Automatic Language Identification in Texts – Langdetect

In the note series of automatic language identification, we had introduced how to detect language using the gcld3 library. Moreover, it is designed to run in the Chrome browser, written in the C++ programming language, based on a neural network model, and supports over 100 languages/scripts.

In this note, we introduce another language identification library called LangDetect. It uses a Naive Bayes classifier with character n-gram to detect language. The Naive Bayes classifier was trained on 49 languages of Wikipedia Data. Initially, it was implemented in Java Programming Language. However, in 2014, Python binding was released.

Introduction

Language detection is one of the crucial tasks in NLP, and it becomes more challenging when the same scripts are used to write different languages. In those cases, matching each word with the dictionary of each language does not work. Moreover, it requires additional knowledge of scripts ( or Unicode) and languages. For example:

  • Dari, Kashmiri, etc., are different languages, but the written script is Perso-Arabic script.
  • English, French, German, and Italian are other languages, but the written script is Roman.
  • Hindi, Marathi, Nepali, etc., are different languages, but the written script is Devanagari.

At a very high level, LangDetect used a Naive Bayes classifier with character n-gram. They calculated prior and likelihood probabilities of the scripts of different languages during the learning period and posterior probabilities for prediction.

Environment Setup

We will use the langdetect python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be minicondaanaconda, or native python libraries. If not, you can download and install python distribution from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.

# Create Environment
conda create -n venv_langdetect python=3.8

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_langdetect

In the above code, we are creating a conda virtual environment for python version 3.8 and after creation, we have activated the environment.

Installation of Packages

The below script installs langdetect from the conda-forge repository. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.

conda install -c conda-forge langdetect
pip install 'pycountry==22.1.10'

Python code for gcld3

There are two classes, LID and LangDetectBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the LangDetectBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.

Let us assumed that the below code is stored in lid.py file.

from langdetect import detect,detect_langs
from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

class LangDetectBasedLID(LID):

    def __init__(self):
        super().__init__()

    def detect(self,input):

        try:

            results=detect_langs(text=input)
            if len(results) > 0:
                output = str(results[0])
                self.lang = output.split(":")[0]
                self.score = output.split(":")[1]
                self.lang_name = languages.get(alpha_2=self.lang).name
                return(True)
            else:
                return(False)

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined. 

from lid import LangDetectBasedLID

script = "भगवंत मान ने पंजाब के 17वें मुख्यमंत्री पद की शपथ ली.  पंजाब विधानसभा चुनाव में आम आदमी पार्टी को प्रचंड जीत मिली है. भगवंत मान का शपथ ग्रहण भगत सिंह के गांव खटकर कलां में हो रहा है. बाकी मंत्रियों का शपथग्रहण बाद में होगा. "
obj = LangDetectBasedLID()
if obj.detect(script) == True:
    
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Outputs

Confidence Score:    0.9999966848043597
Language Code:       hi
Language full Name:  Hindi

The confidence score tells the accuracy of the detected language. Its values lie between 0-1, and the higher values represent higher accuracy.

We can test different natural languages texts or scripts using the above code. The whole python source, which uses the langdetect library for automatic language identification, is uplded to the Github repository.

To execute the above code directly using a jupyter notebook, one can follow the Kaggle Notebook.

References


CITE THIS AS:

“Automatic Language Identification in Texts – LangDetect”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-langdetect/

 32,356 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: