Automatic Language Identification in Texts – Introduction

This note series is designed to provide the insides of using open-source language identifiers to predict the natural language of a given input text. We will cover the basics of natural language identification, various openly available packages, environment setup, and how to use the tools.

Introduction

In Natural Language Processing (NLP), language identification determines the natural language in which a script, text, or document is written. Recognizing text in a specific language comes naturally to a human reader familiar with the language. Similarly, for automating the NLP related-task, language identification is a preliminary task as most NLP-related tasks are language or script-dependent.

For example, stop words are different for different languages as part of data cleaning or preprocessing processes. Thus, it is always helpful to know the language before applying any NLP task.

There are few other NLP-related tasks where we need language identifiaction and these are as follows:

  • Sentiment analysis
  • Translation
  • Transliteration

Language Identification Techniques

The language identification techniques has been evolved in the following order:

  • Word-based modelling method
  • N-gram-based modelling method
  • RNN and Deep learning method

When a language is low-resource full then hybrid techniques are most helpful, it may a mixture of rule-based and N-gram-based modelling method. However, when a language text sources are huge then it is always helpful to use Deep learning models.

Dificulties in Language Identification Process

A distinction of similar languages

It is a big problem, as the same script is used to write the different language. For example:

  • Dari, Kashmiri, etc., are different languages, but the written script is Perso-Arabic.
  • English, French, German, and Italian are other languages, but the written script is Roman.
  • Hindi, Marathi, Nepali, etc., are different languages, but the written script is Devanagari.

On the other way round, different scripts for a language.

  • Konkani is a language spoken in the southern part of India, written in approximately six different scripts, including Devanagari and Roman.
  • Punjabi is a language that is spoken both in India and Pakistani. However, it is written in Shahmukhi script in Pakistan and Gurmukhi script in India.

Detection of multilingualism in a single document.

Identifying the language of short texts. 

References


CITE THIS AS:

“Automatic Language Identification in Texts”  From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-introduction/

 12,144 total views,  1 views today

Scroll to Top
Scroll to Top
%d bloggers like this: