Automatic Language Identification in Texts – Sparknlp

In the earlier note on langdetect of this note series automatic language identification, we had introduced how to detect language using the langdetect library, which uses a Naive Bayes classifier with character n-gram to detect language. 

In this note, we introduce another language identification library, which is a part of the sparknlp package. They had designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. This approach is data-driven, and training requires more data and computing power than the previous one.

The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style. However, we provide a function that formats output in a standard iso693 format.

In this use case, we have used pre-trained models and a pre-trained pipeline for language detection, which is provided by sparknlp package. The pre-trained models/pipelines are trained in 375 languages, which means, it supports 375 languages.

Introduction

Language detection is one of the crucial tasks in NLP, and it becomes more challenging when the same scripts are used to write different languages. In those cases, matching each word with the dictionary of each language does not work. Moreover, it requires additional knowledge of scripts ( or Unicode) and languages.

Environment Setup

We will use the sparknlp python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be minicondaanaconda, or native python libraries. If not, you can download and install python distribution from the above links.

Creation of Conda Virtual Environment

You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege. It has been observed that the sparknlp does not work with Python version 3.9. However, it works well with Python version 3.8.

# Create Environment
conda create -n venv_sparknlp python=3.8 -y

# Configure shell for conda
conda init bash

# Activate Environment
conda activate venv_sparknlp

In the above code, we are creating a conda virtual environment for python version 3.8 and after creation, we have activated the environment.

Installation of Packages

The below script installs sparknlp and pyspark from the johnsowlabs and conda-forge repository, respectively. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.

conda install -c conda-forge pyspark=3.1.2 -y
conda install -c johnsnowlabs spark-nlp=3.4.1 -y 
pip install 'pycountry==22.1.10'

Note: These versions are essential and strongly mapped with python versions. If anyone changes the version, then possibly it may not work.  

Python code for SparkNLP language detection

There are two classes, LID and SparkNLPBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the SparkNLPBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries. 

According to the documentation of SparkNLP, it works both in offline and online mode. However, the offline mode does not work in our setup after supplying the correct path of the downloaded pre-trained model or pre-trained pipeline.

Let us assumed that the below code is stored in lid.py file.

import sparknlp
from sparknlp.pretrained import PretrainedPipeline
from sparknlp.base import *
from sparknlp.annotator import *

from pycountry import languages

class LID():

    def __init__(self):
        self.lang = None 
        self.score = None 
        self.lang_name = None 

    def get_lang_code_iso639(self):
        return(self.lang)
    
    def get_confidence_score(self):
        return(self.score)

    def get_lang_name(self):
        return(self.lang_name)

    def detect(self,input) -> bool:
        pass

class SparkNLPBasedLID(LID):

    def __init__(self,model_identity,approach="pretrained_pipeline", mode="online",enable_gpu=False, expected_lang="xx"):
        
        super().__init__()

        self.status = False

        self.model = None

        self.model_identity = model_identity

        self.spark = None

        self.approach = approach
        # pretrained_pipeline, pretrained_models

        try:

            self.status = True    
            self.spark = sparknlp.start()
            print("\nVersion",sparknlp.version())

            if self.approach == "pretrained_pipeline":
                
                if mode == "offline":
                    self.model = PretrainedPipeline(self.model_identity,expected_lang,disk_location=self.model_identity)
                else:
                    self.model = PretrainedPipeline(self.model_identity,expected_lang)

            elif self.approach == "pretrained_model":
                
                if mode == "offline":
                    self.model = LanguageDetectorDL.pretrained(self.model_identity,remote_loc=self.model_identity)
                else:
                    self.model = LanguageDetectorDL.pretrained(self.model_identity)

            else:
                self.status = False 

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            self.status = False

    def detect(self,input) -> bool:
        
        if self.status == False:
            return(self.status)

        if self.approach == "pretrained_pipeline":
            return(self.__detect_pipeline(input))
        elif self.approach == "pretrained_model":
            return(self.__detect_model(input))
        else:
            return(False)

    def __detect_pipeline(self,input) -> bool:

        self.input = input
        try:
            if self.status:
            
                result = self.model.annotate(self.input)
                if "language" in result:
                    lang_list = result["language"]
                    print(lang_list)
                    if len(lang_list) == 1:
                        self.lang = lang_list[0]
                    self.score =0.99
                    self.lang_name = languages.get(alpha_2=self.lang).name
                    return(True)

        except Exception as ex:
            template = "An exception of type {0} occurred. Arguments:\n{1!r}"
            message = template.format(type(ex).__name__, ex.args)
            print(message)
            return(False)
        
        return(False)

    def __detect_model(self,input) -> bool:

        if self.status:
               
            documentAssembler = DocumentAssembler()\
            .setInputCol("text")\
            .setOutputCol("document")
            language_detector = self.model\
            .setInputCols(["document"])\
            .setOutputCol("lang")\
            .setThreshold(0.8)\
            .setCoalesceSentences(True)

            languagePipeline = Pipeline(stages=[
            documentAssembler, 
            language_detector
            ])

            languagePipeline = Pipeline(stages=[documentAssembler, language_detector])
            light_pipeline = LightPipeline(languagePipeline.fit(self.spark.createDataFrame([['']]).toDF("text")))
            #result = light_pipeline.fullAnnotate(input)
            result = light_pipeline.annotate(input)
            if "lang" in result:
                lang_list = result["lang"]
                print(lang_list)
                if len(lang_list) == 1:
                    self.lang = lang_list[0]
                self.score =0.99
                self.lang_name = languages.get(alpha_2=self.lang).name
                return(True)

        return(False)

Language Detection

Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined.

SparkNLPBasedLID is designed to work with pre-trained models and pre-trained pipelines online and offline. Moreover, we will demonstrate all the cases with examples.

Language Detection using pre-trained model at online mode

# Pretrained Model
pretrained_model_online="ld_wiki_tatoeba_cnn_375"
obj = SparkNLPBasedLID(model_identity=pretrained_model_online,mode="online",approach="pretrained_model")
if obj.detect(script) == True:
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Language Detection using pre-trained pipeline at online mode

# Pretrained Pipeline
pretrained_pipeline_online='detect_language_375'
obj = SparkNLPBasedLID(model_identity=pretrained_pipeline_online,mode="online",approach="pretrained_pipeline")
if obj.detect(script) == True:
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Language Detection using pre-trained model at offline mode

# Pretrained Model Offline
pretrained_model_offline='<Location>/lid/models/ld_wiki_tatoeba_cnn_375_xx_2.7.0_2.4_1607184873730'
obj = SparkNLPBasedLID(model_identity=pretrained_model_offline,mode="offline",approach="pretrained_model")
if obj.detect(script) == True:
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Language Detection using pre-trained pipeline at offline mode

# Pretrained Pipeline Offline
pretrained_pipeline_offline='<Location>/lid/models/detect_language_375_xx_2.7.0_2.4_1607185980306'
obj = SparkNLPBasedLID(model_identity=pretrained_pipeline_offline,mode="offline",approach="pretrained_pipeline")
if obj.detect(script) == True:
    print(obj.get_confidence_score())
    print(obj.get_lang_code_iso639())
    print(obj.get_lang_name())

Outputs

Confidence Score:    0.99
Language Code:       bo
Language full Name:  Tibetan

We can test different natural languages texts or scripts using the above code. The whole python source, which uses the sparknlp library for automatic language identification, is uploaded to the Github repository. Even though we tried to make Kaggle Notebook, the current version of the sparknlp is not compatible with python 3.7, which is currently used by Kaggle notebook.

References


CITE THIS AS:

“Automatic Language Identification in Texts – SparkNLP”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-sparknlp/

 27,969 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: