Automatic Language Identification in Texts – GCLD3

In the previous note on automatic language identification, we had introduced how to detect language using fasttext. Fasttext is a library created by Facebook’s AI research lab for efficient learning of text representations and classification.

In this note, we introduce another language identification library called Google Compact Language Detector v3 (GCLD3). GCLD3 is designed to run in the Chrome browser and written in the C++ programming language. It is based on a neural network model and supports over 100 languages/scripts. Fortunately, they provided python binding, which facilitates us to execute GCLD3 functions directly from the python programming environment.

Introduction

CLD3 is a neural network model for language identification. It contains the inference code and a trained model. The inference code extracts character ngrams from the input text and computes the fraction of times each of them appears. For example, as shown in the figure below, if the input text is “banana”, then one of the extracted trigrams is “ana” and the corresponding fraction is 2/4. The ngrams are hashed down to an id within a small range, and each id is represented by a dense embedding vector estimated during training.

The model averages the embeddings corresponding to each ngram type according to the fractions, and the averaged embeddings are concatenated to produce the embedding layer. The remaining components of the network are a hidden (Rectified linear) layer and a softmax layer.

GCLD3
Language Identification Using Google Compact Language Detector v3

Environment Setup

We will use the gcld3 python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be minicondaanaconda, or native python libraries. If not, you can download and install python distribution from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.

# Create Environment
conda create -n venv_gcld3 python=3.8

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_gcld3

In the above code, we are creating a conda virtual environment for python version 3.8. According to our experience, it works well with python version 3.8. However, python version 3.9 has a problem due to unknown reasons that we haven’t explored properly.

Installation of Packages

The below script installs gcld v3 from the pypi repository. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.

pip install gcld3
pip install 'pycountry==22.1.10'

Python code for gcld3

There are two classes, LID and GCLD3BasedLID. LID defines various standard methods essential for any natural language detection operations. However, the GCLD3BasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.

Let us assumed that the below code is stored in lid.py file.

from gcld3 import NNetLanguageIdentifier
from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

class GCLD3BasedLID(LID):

    def __init__(self,input_min_limit=0,input_max_limit=20480):
        super().__init__()
        self.max_limit = input_max_limit
        self.min_limit = input_min_limit
        self.max_top_langs = 2
        self.detector = NNetLanguageIdentifier(min_num_bytes=self.min_limit, max_num_bytes=self.max_limit)

    def detect(self,input):
        return(self.__ndetect(input))

    def __detect(self,input):

        try:
            
            results=self.detector.FindLanguage(input)
            self.lang = results.language
            self.score = results.probability
            self.lang_name = languages.get(alpha_2=self.lang).name
            return(True)
            

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)
    
    def __ndetect(self,input):

        top_result = None

        try:
            
            results=self.detector.FindTopNMostFreqLangs(text=input,num_langs=self.max_top_langs)
            
            for i in results:

                if i.is_reliable == True:

                    if top_result == None:
                        top_result = i
                    else:
                        if top_result.probability < i.probability:
                            top_result = i

            if top_result != None:
                self.lang = top_result.language
                self.score = top_result.probability
                self.lang_name = languages.get(alpha_2=self.lang).name    
                return(True)
            else:
                return(False)
            
        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined. 

from lid import GCLD3BasedLID

obj = GCLD3BasedLID()
    
script = "این سومین نشست وزیران کشورهای منطقه است که در مورد بحران افغانستان طی کمتر از هفت ماه"
if obj.detect(script) == True:
    
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Outputs

Confidence Score:    1.0
Language Code:       fa
Language full Name:  Persian

The confidence score tells the accuracy of the detected language. Its values lie between 0-1, and the higher values represent higher accuracy.

We can test different natural languages texts or scripts using the above code. The whole python source, which uses the gcld3 library for automatic language identification, is uploaded to the Github repository.

References


CITE THIS AS:

“Automatic Language Identification in Texts – GCLD3”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-gcld3/

 12,245 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: