Automatic Language Identification in Texts – Fasttext

Language detection is vital in Natural Langauge Processing (NLP), as different NLP tasks or activities are language-dependent. Moreover, finding the best language detector that can support most natural language, short text, and multilingual texts is difficult. However, the Fasttext library performs well compared to other automatic language identification libraries such as gcld3, langdetect, langid, nltk_textcat, polyglot, spacy_langdetect, sparknlp_detect, etc., according to our studies.

Introduction

Fasttext is a library created by Facebook’s AI research lab for efficient learning of text representations and classification. For technical concepts, one may refer to the below papers links.

It is one of the most optimized libraries for text representation and classification as it has been built while keeping production use cases in mind, such as a time complexity and storage capacity. Further, they provided a pre-trained model which supports 176 natural languages. The model was trained on a dataset drawn from Wikipedia, Tatoeba, and SETimes.

Environment Setup

We will use the Fasttext python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be minicondaanaconda, or native python libraries. If not, you can download and install python distributions from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.

# Create Environment
conda create -n venv_fasttext python=3.9.5

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_fasttext

In the above code, we are creating a conda virtual environment for python version 3.9.5, as at this moment, this is one of the stable and latest python versions.

Installation of Packages

The below script installs fasttext 0.9.2 version from the conda-forge repository. It also installs pycountry, which is helpful to get language names from the iso639 lang code.

conda install -c conda-forge -y fasttext=0.9.2
pip install 'pycountry==22.1.10'

Python Code for Fasttext

There are two classes, LID and FastTextBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the FastTextBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.

Let us assumed that the below code is stored in lid.py file.

import fasttext
from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

class FastTextBasedLID(LID):

    def __init__(self,pretrained_model_path):
        
        super().__init__()
        fasttext.FastText.eprint = lambda x: None

        self.pretrained_model = pretrained_model_path
        try:
            self.model = fasttext.load_model(self.pretrained_model)

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            self.model = None

    def detect(self,input) -> bool:

        self.input = input
        try:
            if self.model:
                predictions = self.model.predict(self.input,k=1)
                if predictions:
                    self.lang = predictions[0][0].split("__label__")[1]
                    self.score =predictions[1][0]
                    self.lang_name = languages.get(alpha_2=self.lang).name
                    return(True)
        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)
        
        return(False)

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined in the above code. 

Suppose there is a directory called models containing fasttext lid model provided by Fasttext library.

from lid import FastTextBasedLID

model="./models/lid.176.bin"
script = "This note series is designed to provide the insides of using open-source language identifiers to predict the natural language of a given input text."

obj = FastTextBasedLID(model)
if obj.detect(script) == True:
    print("\n Confidence Score:",obj.get_confidence_score())
    print("\n Code",obj.get_lang_code_iso639())
    print("\n Language Name",obj.get_lang_name())

Outputs

Confidence Score:    0.827755331993103
Language Code:      en
Language full Name: English

The confidence score tells the certainty about the detected language. So, the higher the value represents higher the accuracy.

We can test different natural languages texts using the above code. The entire python source code of fasttext automatic language identification is uploaded in the notepub.io Github repository.

To execute the above code directly using a jupyter notebook, one can follow the Kaggle Notebook.

References


CITE THIS AS:

“Automatic Language Identification in Texts using fasttext python library”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-using-fasttext-python-package/

Loading

Scroll to Top
Scroll to Top