In the earlier note on langid of this note series automatic language identification, we introduced how to detect language using the langid tool, which uses a naive Bayes classifier with a multinomial event model over a mixture of character n-grams and trained over 97 languages. It provided additional tools for model building, training, tokenization, etc., that are helpful to build a model for a specific language. In addition, it is fast, non-sensitive to domain-specific features, and easy to install and use.
Introduction
In this note, we introduce another language identification library called Polyglot. A dictionary meaning of polyglot is defined as a polyglot is a person who can speak, read or write in several languages.
Polyglot is a natural language pipeline that supports massive multilingual applications. It supports the following features:
- Tokenization (165 Languages)
- Language detection (196 Languages)
- Named Entity Recognition (40 Languages)
- Part of Speech Tagging (16 Languages)
- Sentiment Analysis (136 Languages)
- Word Embeddings (137 Languages)
- Morphological analysis (135 Languages)
- Transliteration (69 Languages)
However, our interest is in language detection, and the Polyglot internally uses pycld2 library, which in turn depends on cld2 library for detecting language(s) used in plain text. Moreover, if someone is interested to read more about cld3, which is the successor of cld2, you can refer to the following CLD3 note.
Environment Setup
We will use the Polyglot NLP toolkit to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be miniconda, anaconda, or native python libraries. If not, you can download and install python distribution from the above links.
Creation of Conda Virtual Environment
You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.
# Create Environment conda create -n venv_polyglot python=3.9.5 -y # Configure shell for conda conda init bash # Activate Environment conda activate venv_polyglot
In the above code, we are creating a conda virtual environment for python version 3.9.5 and after creation, we have activated the environment.
Installation of Packages
The below script installs Polyglot from PyPI and conda repositories. It also installs pycountry, which is helpful to get language names from the iso-639 lang code. In addition, the installation of the Polyglot is a little tricky. For that reason, we try to use the conda repository as it is more than a package manager utility software.
conda install six=1.16.0 -y # Polyglot. conda install -c conda-forge -y pyicu=2.7.4 pip install 'morfessor==2.0.6' pip install 'polyglot==16.7.4' conda install -c conda-forge -y pycld2=0.41 pip install 'pycountry==22.1.10'
Python code for Polyglot language detection
There are two classes, LID and PloyGlotBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the LangIdBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.
Let us assumed that the below code is stored in lid.py file.
from polyglot.detect import Detector from polyglot.utils import pretty_list from pycountry import languages class LID(): def __init__(self): self.lang = None self.score = None self.lang_name = None def get_lang_code_iso639(self): return(self.lang) def get_confidence_score(self): return(self.score) def get_lang_name(self): return(self.lang_name) def detect(self,input) -> bool: pass def supported_langs(self): pass class PolyGlotBasedLID(LID): def __init__(self): super().__init__() self.input = None def detect(self,input) -> bool: self.input = input try: detector = Detector(self.input) str_lst = str(detector.language).split(" ") if str_lst: str_lst = [st for st in str_lst if st != ""] self.lang = str_lst[3] self.score = float(str_lst[5])/100 self.lang_name = str_lst[1] return(True) except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) self.status = False return(False) def supported_langs(self): print(pretty_list(Detector.supported_languages()))
Language Detection
Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes.
PolyGlotBasedLID is designed to work with pre-trained models.
from lid import PolyGlotBasedLID script=" را انتقاد مى کند و مى گويد که اعلام نتایج نهایی چهار حوزه انتخاباتی در روز گشايش شورا پرسش برانگيز است.ده بود و در روز افتتاح (ششم ثور) نتایج نهایی ولایات میدان وردک، بغلان، کندز و حوزه کوچی هاى اين جرگه را نيز اعلام کر" obj = PolyGlotBasedLID() print(obj.supported_langs()) if obj.detect(script) == True: print("LID: PolyGlotBasedLID") print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Outputs
Confidence Score: 0.99 Language Code: fa Language full Name: Persian
We can test different natural languages texts or scripts using the above code. The whole python source, which uses the polyglot NLP toolkit for automatic language identification, is uploaded to the Github repository.
References
- https://github.com/notepub-io/language-identification-in-texts/blob/main/polyglot
- https://www.diffen.com/difference/Linguist_vs_Polyglot
CITE THIS AS:
“Automatic Language Identification in Texts – Polyglot” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-polyglot/
32,363 total views, 1 views today