In the earlier note on langdetect of this note series automatic language identification, we had introduced how to detect language using the langdetect library, which uses a Naive Bayes classifier with character n-gram to detect language.
In this note, we introduce another language identification library, which is a part of the sparknlp package. They had designed and developed Deep Learning models using CNN architectures in TensorFlow/Keras. This approach is data-driven, and training requires more data and computing power than the previous one.
The models are trained on large datasets such as Wikipedia and Tatoeba with high accuracy evaluated on the Europarl dataset. The output is a language code in Wiki Code style. However, we provide a function that formats output in a standard iso693 format.
In this use case, we have used pre-trained models and a pre-trained pipeline for language detection, which is provided by sparknlp package. The pre-trained models/pipelines are trained in 375 languages, which means, it supports 375 languages.
Introduction
Language detection is one of the crucial tasks in NLP, and it becomes more challenging when the same scripts are used to write different languages. In those cases, matching each word with the dictionary of each language does not work. Moreover, it requires additional knowledge of scripts ( or Unicode) and languages.
Environment Setup
We will use the sparknlp python library to test and evaluate automatic language identification from the given input texts. At the very first, we will create a conda virtual environment. However, it has been assumed that one of the python distributions has been installed on your computer. These distributions could be miniconda, anaconda, or native python libraries. If not, you can download and install python distribution from the above links.
Creation of Conda Virtual Environment
You can run the below shell script code in your operating system’s terminal in the same sequence without the root privilege. It has been observed that the sparknlp does not work with Python version 3.9. However, it works well with Python version 3.8.
# Create Environment conda create -n venv_sparknlp python=3.8 -y # Configure shell for conda conda init bash # Activate Environment conda activate venv_sparknlp
In the above code, we are creating a conda virtual environment for python version 3.8 and after creation, we have activated the environment.
Installation of Packages
The below script installs sparknlp and pyspark from the johnsowlabs and conda-forge repository, respectively. It also installs pycountry, which is helpful to get language names from the iso-639 lang code.
conda install -c conda-forge pyspark=3.1.2 -y conda install -c johnsnowlabs spark-nlp=3.4.1 -y pip install 'pycountry==22.1.10'
Note: These versions are essential and strongly mapped with python versions. If anyone changes the version, then possibly it may not work.
Python code for SparkNLP language detection
There are two classes, LID and SparkNLPBasedLID. LID defines various standard methods essential for any natural language detection operations. However, the SparkNLPBasedLID class inherits the LID class and overrides the detect function. It is designed in such a way because it may be entirely possible that the detect function input parameters and logics may be different for different language identification libraries.
According to the documentation of SparkNLP, it works both in offline and online mode. However, the offline mode does not work in our setup after supplying the correct path of the downloaded pre-trained model or pre-trained pipeline.
Let us assumed that the below code is stored in lid.py file.
import sparknlp from sparknlp.pretrained import PretrainedPipeline from sparknlp.base import * from sparknlp.annotator import * from pycountry import languages class LID(): def __init__(self): self.lang = None self.score = None self.lang_name = None def get_lang_code_iso639(self): return(self.lang) def get_confidence_score(self): return(self.score) def get_lang_name(self): return(self.lang_name) def detect(self,input) -> bool: pass class SparkNLPBasedLID(LID): def __init__(self,model_identity,approach="pretrained_pipeline", mode="online",enable_gpu=False, expected_lang="xx"): super().__init__() self.status = False self.model = None self.model_identity = model_identity self.spark = None self.approach = approach # pretrained_pipeline, pretrained_models try: self.status = True self.spark = sparknlp.start() print("\nVersion",sparknlp.version()) if self.approach == "pretrained_pipeline": if mode == "offline": self.model = PretrainedPipeline(self.model_identity,expected_lang,disk_location=self.model_identity) else: self.model = PretrainedPipeline(self.model_identity,expected_lang) elif self.approach == "pretrained_model": if mode == "offline": self.model = LanguageDetectorDL.pretrained(self.model_identity,remote_loc=self.model_identity) else: self.model = LanguageDetectorDL.pretrained(self.model_identity) else: self.status = False except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) self.status = False def detect(self,input) -> bool: if self.status == False: return(self.status) if self.approach == "pretrained_pipeline": return(self.__detect_pipeline(input)) elif self.approach == "pretrained_model": return(self.__detect_model(input)) else: return(False) def __detect_pipeline(self,input) -> bool: self.input = input try: if self.status: result = self.model.annotate(self.input) if "language" in result: lang_list = result["language"] print(lang_list) if len(lang_list) == 1: self.lang = lang_list[0] self.score =0.99 self.lang_name = languages.get(alpha_2=self.lang).name return(True) except Exception as ex: template = "An exception of type {0} occurred. Arguments:\n{1!r}" message = template.format(type(ex).__name__, ex.args) print(message) return(False) return(False) def __detect_model(self,input) -> bool: if self.status: documentAssembler = DocumentAssembler()\ .setInputCol("text")\ .setOutputCol("document") language_detector = self.model\ .setInputCols(["document"])\ .setOutputCol("lang")\ .setThreshold(0.8)\ .setCoalesceSentences(True) languagePipeline = Pipeline(stages=[ documentAssembler, language_detector ]) languagePipeline = Pipeline(stages=[documentAssembler, language_detector]) light_pipeline = LightPipeline(languagePipeline.fit(self.spark.createDataFrame([['']]).toDF("text"))) #result = light_pipeline.fullAnnotate(input) result = light_pipeline.annotate(input) if "lang" in result: lang_list = result["lang"] print(lang_list) if len(lang_list) == 1: self.lang = lang_list[0] self.score =0.99 self.lang_name = languages.get(alpha_2=self.lang).name return(True) return(False)
Language Detection
Let us assume that the below code is stored in the main.py file, and there is another file, lid.py, which contains all the classes defined.
SparkNLPBasedLID is designed to work with pre-trained models and pre-trained pipelines online and offline. Moreover, we will demonstrate all the cases with examples.
Language Detection using pre-trained model at online mode
# Pretrained Model pretrained_model_online="ld_wiki_tatoeba_cnn_375" obj = SparkNLPBasedLID(model_identity=pretrained_model_online,mode="online",approach="pretrained_model") if obj.detect(script) == True: print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Language Detection using pre-trained pipeline at online mode
# Pretrained Pipeline pretrained_pipeline_online='detect_language_375' obj = SparkNLPBasedLID(model_identity=pretrained_pipeline_online,mode="online",approach="pretrained_pipeline") if obj.detect(script) == True: print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Language Detection using pre-trained model at offline mode
# Pretrained Model Offline pretrained_model_offline='<Location>/lid/models/ld_wiki_tatoeba_cnn_375_xx_2.7.0_2.4_1607184873730' obj = SparkNLPBasedLID(model_identity=pretrained_model_offline,mode="offline",approach="pretrained_model") if obj.detect(script) == True: print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Language Detection using pre-trained pipeline at offline mode
# Pretrained Pipeline Offline pretrained_pipeline_offline='<Location>/lid/models/detect_language_375_xx_2.7.0_2.4_1607185980306' obj = SparkNLPBasedLID(model_identity=pretrained_pipeline_offline,mode="offline",approach="pretrained_pipeline") if obj.detect(script) == True: print(obj.get_confidence_score()) print(obj.get_lang_code_iso639()) print(obj.get_lang_name())
Outputs
Confidence Score: 0.99 Language Code: bo Language full Name: Tibetan
We can test different natural languages texts or scripts using the above code. The whole python source, which uses the sparknlp library for automatic language identification, is uploaded to the Github repository. Even though we tried to make Kaggle Notebook, the current version of the sparknlp is not compatible with python 3.7, which is currently used by Kaggle notebook.
References
- https://github.com/notepub-io/language-identification-in-texts/tree/main/sparknlp
- https://nlp.johnsnowlabs.com/2020/12/05/detect_language_375_xx.html
CITE THIS AS:
“Automatic Language Identification in Texts – SparkNLP” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-sparknlp/