In the note series of automatic language identification, we had introduced how to detect language using the gcld3 library. Moreover, it is designed to run in the Chrome browser, written in the C++ programming language, based on a neural network model, and supports over 100 languages/scripts.
In this note, we introduce another language identification library called LangDetect. It uses a Naive Bayes classifier with character n-gram to detect language. The Naive Bayes classifier was trained on 49 languages of Wikipedia Data. Initially, it was implemented in Java Programming Language. However, in 2014, Python binding was released.
Introduction
Language detection is one of the crucial tasks in NLP, and it becomes more challenging when the same scripts are used to write different languages. In those cases, matching each word with the dictionary of each language does not work. Moreover, it requires additional knowledge of scripts ( or Unicode) and languages. For example:
- Dari, Kashmiri, etc., are different languages, but the written script is Perso-Arabic script.
- English, French, German, and Italian are other languages, but the written script is Roman.
- Hindi, Marathi, Nepali, etc., are different languages, but the written script is Devanagari.
At a very high level, LangDetect used a Naive Bayes classifier with character n-gram. They calculated prior and likelihood probabilities of the scripts of different languages during the learning period and posterior probabilities for prediction.
Environment Setup
We will use the langdetect python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be miniconda, anaconda, or native python libraries. If not, you can download and install python distribution from the above links.
Creation of Conda Virtual Environment
You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.
# Create Environment conda create -n venv_langdetect python=3.8 # Configure shell for conda conda init bash # Activate Environment conda activate venv_langdetect
In the above code, we are creating a conda virtual environment for python version 3.8 and after creation, we have activated the environment.
Installation of Packages
The below script installs langdetect from the conda-forge repository. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.
conda install -c conda-forge langdetect pip install 'pycountry==22.1.10'
Python code for gcld3
There are two classes, LID and LangDetectBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the LangDetectBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.
Let us assumed that the below code is stored in lid.py file.
from langdetect import detect,detect_langs from pycountry import languages class LID(): def __init__(self): self.lang = None self.score = None self.lang_name = None def get_lang_code_iso639(self): return(self.lang) def get_confidence_score(self): return(self.score) def get_lang_name(self): return(self.lang_name) def detect(self,input) -> bool: pass class LangDetectBasedLID(LID): def __init__(self): super().__init__() def detect(self,input): try: results=detect_langs(text=input) if len(results) > 0: output = str(results[0]) self.lang = output.split(":")[0] self.score = output.split(":")[1] self.lang_name = languages.get(alpha_2=self.lang).name return(True) else: return(False) except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) return(False)
Language Detection
Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined.
from lid import LangDetectBasedLID script = "भगवंत मान ने पंजाब के 17वें मुख्यमंत्री पद की शपथ ली. पंजाब विधानसभा चुनाव में आम आदमी पार्टी को प्रचंड जीत मिली है. भगवंत मान का शपथ ग्रहण भगत सिंह के गांव खटकर कलां में हो रहा है. बाकी मंत्रियों का शपथग्रहण बाद में होगा. " obj = LangDetectBasedLID() if obj.detect(script) == True: print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Outputs
Confidence Score: 0.9999966848043597 Language Code: hi Language full Name: Hindi
The confidence score tells the accuracy of the detected language. Its values lie between 0-1, and the higher values represent higher accuracy.
We can test different natural languages texts or scripts using the above code. The whole python source, which uses the langdetect library for automatic language identification, is uplded to the Github repository.
To execute the above code directly using a jupyter notebook, one can follow the Kaggle Notebook.
References
- https://pypi.org/project/langdetect/
- https://www.kaggle.com/code/notepub/notebookc28a4a9641
- https://github.com/notepub-io/language-identification-in-texts/tree/main/langdetect
CITE THIS AS:
“Automatic Language Identification in Texts – LangDetect” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-langdetect/
32,358 total views, 1 views today