Language detection is vital in Natural Langauge Processing (NLP), as different NLP tasks or activities are language-dependent. Moreover, finding the best language detector that can support most natural language, short text, and multilingual texts is difficult. However, the Fasttext library performs well compared to other automatic language identification libraries such as gcld3, langdetect, langid, nltk_textcat, polyglot, spacy_langdetect, sparknlp_detect, etc., according to our studies.
Introduction
Fasttext is a library created by Facebook’s AI research lab for efficient learning of text representations and classification. For technical concepts, one may refer to the below papers links.
It is one of the most optimized libraries for text representation and classification as it has been built while keeping production use cases in mind, such as a time complexity and storage capacity. Further, they provided a pre-trained model which supports 176 natural languages. The model was trained on a dataset drawn from Wikipedia, Tatoeba, and SETimes.
Environment Setup
We will use the Fasttext python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be miniconda, anaconda, or native python libraries. If not, you can download and install python distributions from the above links.
Creation of Conda Virtual Environment
You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege.
# Create Environment conda create -n venv_fasttext python=3.9.5 # Configure shell for conda conda init bash # Activate Environment conda activate venv_fasttext
In the above code, we are creating a conda virtual environment for python version 3.9.5, as at this moment, this is one of the stable and latest python versions.
Installation of Packages
The below script installs fasttext 0.9.2 version from the conda-forge repository. It also installs pycountry, which is helpful to get language names from the iso639 lang code.
conda install -c conda-forge -y fasttext=0.9.2 pip install 'pycountry==22.1.10'
Python Code for Fasttext
There are two classes, LID and FastTextBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the FastTextBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.
Let us assumed that the below code is stored in lid.py file.
import fasttext from pycountry import languages class LID(): def __init__(self): self.lang = None self.score = None self.lang_name = None def get_lang_code_iso639(self): return(self.lang) def get_confidence_score(self): return(self.score) def get_lang_name(self): return(self.lang_name) def detect(self,input) -> bool: pass class FastTextBasedLID(LID): def __init__(self,pretrained_model_path): super().__init__() fasttext.FastText.eprint = lambda x: None self.pretrained_model = pretrained_model_path try: self.model = fasttext.load_model(self.pretrained_model) except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) self.model = None def detect(self,input) -> bool: self.input = input try: if self.model: predictions = self.model.predict(self.input,k=1) if predictions: self.lang = predictions[0][0].split("__label__")[1] self.score =predictions[1][0] self.lang_name = languages.get(alpha_2=self.lang).name return(True) except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) return(False) return(False)
Language Detection
Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined in the above code.
Suppose there is a directory called models containing fasttext lid model provided by Fasttext library.
from lid import FastTextBasedLID model="./models/lid.176.bin" script = "This note series is designed to provide the insides of using open-source language identifiers to predict the natural language of a given input text." obj = FastTextBasedLID(model) if obj.detect(script) == True: print("\n Confidence Score:",obj.get_confidence_score()) print("\n Code",obj.get_lang_code_iso639()) print("\n Language Name",obj.get_lang_name())
Outputs
Confidence Score: 0.827755331993103 Language Code: en Language full Name: English
The confidence score tells the certainty about the detected language. So, the higher the value represents higher the accuracy.
We can test different natural languages texts using the above code. The entire python source code of fasttext automatic language identification is uploaded in the notepub.io Github repository.
To execute the above code directly using a jupyter notebook, one can follow the Kaggle Notebook.
References
- https://fasttext.cc/docs/en/language-identification.html
- https://github.com/notepub-io/language-identification-in-texts/tree/main/fasttext
- https://www.kaggle.com/notepub/notebook3ed346aef5
CITE THIS AS:
“Automatic Language Identification in Texts using fasttext python library” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-using-fasttext-python-package/