Automatic Language Identification in Texts – LangId

In the earlier note on sparknlp of this note series automatic language identification, we introduced how to detect language using the sparknlp library, which uses pre-trained deep learning models generated using CNN architectures in TensorFlow/Keras. Currently, they have published pre-trained models that can detect 375 languages, which is significantly higher than any other open-source library.

Introduction

In this note, we introduce another language identification tool called langid. It uses a naive Bayes classifier with a multinomial event model over a mixture of character n-grams and trained over 97 languages. It provided additional tools for model building, training, tokenization, etc., that are helpful to build a model for a specific language. In addition, it is fast, non-sensitive to domain-specific features, and easy to install and use.

According to its specification, it follows the following design principles.

  • Fast
  • Pre-trained over a large number of languages (currently 97)
  • Non sensitive to domain-specific features (e.g HTML/XML markup)
  • Single file implementation with minimal dependencies in Python Programming Language.
  • Deployable as a web service

It is also implemented in Java Programming Language and C Programming Language. It has been mentioned that C implementation is 20x faster than Python Implementation. One who wants high speed, can go with C implemented binary.

It is based on the following published research.

  • Lui, Marco and Timothy Baldwin (2011) Cross-domain Feature Selection for Language Identification, In Proceedings of the Fifth International Joint Conference on Natural Language Processing (IJCNLP 2011), Chiang Mai, Thailand, pp. 553—561. Available from http://www.aclweb.org/anthology/I11-1062
  • Lui, Marco and Timothy Baldwin (2012) langid.py: An Off-the-shelf Language Identification Tool, In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL 2012), Demo Session, Jeju, Republic of Korea. Available from www.aclweb.org/anthology/P12-3005
  • Kenneth Heafield and Rohan Kshirsagar and Santiago Barona (2015) Language Identification and Modeling in Specialized Hardware, In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Available from http://aclweb.org/anthology/P15-2063

The training data was drawn from 5 different sources:

  • JRC-Acquis
  • ClueWeb 09
  • Wikipedia
  • Reuters RCV2
  • Debian i18n

To read more about langid language identification and detection tool. Please visit the following site. This project is beneficial for someone who wants to know how to clear data, tokenize, parse, etc. Even they had provided an example directory, where they had shown how to prepare data for training.

Environment Setup

We will use the langid python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be minicondaanaconda, or native python libraries. If not, you can download and install python distribution from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.

# Create Environment
conda create -n venv_langid python=3.9.5 -y

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_langid

In the above code, we are creating a conda virtual environment for python version 3.9.5 and after creation, we have activated the environment.

Installation of Packages

The below script installs langid from PyPI repository. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.

pip install langid
pip install 'pycountry==22.1.10'

Python code for LangId language detection

There are two classes, LID and LangIdBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the LangIdBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries. 

Let us assumed that the below code is stored in lid.py file.

from langid import classify
from langid.langid import LanguageIdentifier, model

from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

class LangIdBasedLID(LID):

    def __init__(self):
        super().__init__()
        self.identifier = LanguageIdentifier.from_modelstring(model, norm_probs=True)
        
    def detect(self,input):

        try:

            results=self.identifier.classify(input)
            if len(results) > 0:
                self.lang = results[0]
                self.score = results[1]
                self.lang_name = languages.get(alpha_2=self.lang).name
                return(True)
            else:
                return(False)

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes.

LangIdBasedLID is designed to work with pre-trained models.

from lid import LangIdBasedLID

script = "It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries. "
obj = LangIdBasedLID()
if obj.detect(script) == True:
    
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Outputs

Confidence Score:    1.0
Language Code:       en
Language full Name:  English

We can test different natural languages texts or scripts using the above code. The whole python source, which uses the langid library for automatic language identification, is uploaded to the Github repository

References


CITE THIS AS:

“Automatic Language Identification in Texts – LangId”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-langid/

Loading

Scroll to Top
Scroll to Top