Automatic Language Identification in Texts - Polyglot

In the earlier note on langid of this note series automatic language identification, we introduced how to detect language using the langid tool, which uses a naive Bayes classifier with a multinomial event model over a mixture of character n-grams and trained over 97 languages. It provided additional tools for model building, training, tokenization, etc., that are helpful to build a model for a specific language. In addition, it is fast, non-sensitive to domain-specific features, and easy to install and use.

Table of Contents hide

1 Introduction

1.1 Creation of Conda Virtual Environment

1.2 Installation of Packages

1.3 Language Detection

2 References

2.1 Share this:

2.2 Like this:

Introduction

In this note, we introduce another language identification library called Polyglot. A dictionary meaning of polyglot is defined as a polyglot is a person who can speak, read or write in several languages.

Polyglot is a natural language pipeline that supports massive multilingual applications. It supports the following features:

Tokenization (165 Languages)
Language detection (196 Languages)
Named Entity Recognition (40 Languages)
Part of Speech Tagging (16 Languages)
Sentiment Analysis (136 Languages)
Word Embeddings (137 Languages)
Morphological analysis (135 Languages)
Transliteration (69 Languages)

However, our interest is in language detection, and the Polyglot internally uses pycld2 library, which in turn depends on cld2 library for detecting language(s) used in plain text. Moreover, if someone is interested to read more about cld3, which is the successor of cld2, you can refer to the following CLD3 note.

Environment Setup

We will use the Polyglot NLP toolkit to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be miniconda, anaconda, or native python libraries. If not, you can download and install python distribution from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.

# Create Environment
conda create -n venv_polyglot python=3.9.5 -y

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_polyglot

In the above code, we are creating a conda virtual environment for python version 3.9.5 and after creation, we have activated the environment.

Installation of Packages

The below script installs Polyglot from PyPI and conda repositories. It also installs pycountry, which is helpful to get language names from the iso-639 lang code. In addition, the installation of the Polyglot is a little tricky. For that reason, we try to use the conda repository as it is more than a package manager utility software.

conda install six=1.16.0 -y

# Polyglot.
conda install -c conda-forge -y pyicu=2.7.4
pip install 'morfessor==2.0.6'
pip install 'polyglot==16.7.4'
conda install -c conda-forge -y pycld2=0.41
pip install 'pycountry==22.1.10'

Python code for Polyglot language detection

There are two classes, LID and PloyGlotBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the LangIdBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.

Let us assumed that the below code is stored in lid.py file.

from polyglot.detect import Detector
from polyglot.utils import pretty_list

from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

    def supported_langs(self):
        pass 

class PolyGlotBasedLID(LID):

    def __init__(self):
        super().__init__()
        self.input = None 

    def detect(self,input) -> bool:
        self.input = input
        try:
            detector = Detector(self.input)
            str_lst = str(detector.language).split(" ")
            if str_lst:
                str_lst = [st for st in str_lst if st != ""]
                self.lang = str_lst[3]
                self.score = float(str_lst[5])/100
                self.lang_name = str_lst[1]
                return(True)
        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            self.status = False

        return(False)
    
    def supported_langs(self):
        print(pretty_list(Detector.supported_languages()))

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes.

PolyGlotBasedLID is designed to work with pre-trained models.

from lid import PolyGlotBasedLID

script=" را انتقاد مى کند و مى گويد که اعلام نتایج نهایی چهار حوزه انتخاباتی در روز گشايش شورا پرسش برانگيز است.ده بود و در روز افتتاح (ششم ثور) نتایج نهایی ولایات میدان وردک، بغلان، کندز و حوزه کوچی هاى اين جرگه را نيز اعلام کر"

obj = PolyGlotBasedLID()
print(obj.supported_langs())

if obj.detect(script) == True:
    print("LID: PolyGlotBasedLID")
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Outputs

Confidence Score:    0.99
Language Code:       fa
Language full Name:  Persian

We can test different natural languages texts or scripts using the above code. The whole python source, which uses the polyglot NLP toolkit for automatic language identification, is uploaded to the Github repository.

References

CITE THIS AS:

“Automatic Language Identification in Texts – Polyglot” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-polyglot/

Automatic Language Identification in Texts – Polyglot

Introduction

Creation of Conda Virtual Environment

Installation of Packages

Language Detection

References

Like this:

NotePub

Indranagar,
Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

Introduction

Creation of Conda Virtual Environment

Installation of Packages

Language Detection

References

Share this:

Like this:

NotePub

Indranagar, Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

Indranagar,
Bangalore - 560038, Karnataka, India