Automatic Language Identification in Texts - Introduction

This note series is designed to provide the insides of using open-source language identifiers to predict the natural language of a given input text. We will cover the basics of natural language identification, various openly available packages, environment setup, and how to use the tools.

Table of Contents hide

1 Introduction

1.1 Language Identification Techniques

1.2 Dificulties in Language Identification Process

1.2.1 A distinction of similar languages

1.2.2 Detection of multilingualism in a single document.

1.2.3 Identifying the language of short texts.

2 References

2.1 Share this:

2.2 Like this:

Introduction

In Natural Language Processing (NLP), language identification determines the natural language in which a script, text, or document is written. Recognizing text in a specific language comes naturally to a human reader familiar with the language. Similarly, for automating the NLP related-task, language identification is a preliminary task as most NLP-related tasks are language or script-dependent.

For example, stop words are different for different languages as part of data cleaning or preprocessing processes. Thus, it is always helpful to know the language before applying any NLP task.

There are few other NLP-related tasks where we need language identifiaction and these are as follows:

Sentiment analysis
Translation
Transliteration

Language Identification Techniques

The language identification techniques has been evolved in the following order:

Word-based modelling method
N-gram-based modelling method
RNN and Deep learning method

When a language is low-resource full then hybrid techniques are most helpful, it may a mixture of rule-based and N-gram-based modelling method. However, when a language text sources are huge then it is always helpful to use Deep learning models.

Dificulties in Language Identification Process

A distinction of similar languages

It is a big problem, as the same script is used to write the different language. For example:

Dari, Kashmiri, etc., are different languages, but the written script is Perso-Arabic.
English, French, German, and Italian are other languages, but the written script is Roman.
Hindi, Marathi, Nepali, etc., are different languages, but the written script is Devanagari.

On the other way round, different scripts for a language.

Konkani is a language spoken in the southern part of India, written in approximately six different scripts, including Devanagari and Roman.
Punjabi is a language that is spoken both in India and Pakistani. However, it is written in Shahmukhi script in Pakistan and Gurmukhi script in India.

Detection of multilingualism in a single document.

Identifying the language of short texts.

References

https://www.jstor.org/stable/44131454

CITE THIS AS:

“Automatic Language Identification in Texts” From NotePub.io – Publish & Share Note! https://notepub.io/notes/artificial-intelligence/speech-and-language-processing/language-identification/lid-introduction/

Automatic Language Identification in Texts – Introduction

Introduction

Language Identification Techniques

Dificulties in Language Identification Process

A distinction of similar languages

Detection of multilingualism in a single document.

Identifying the language of short texts.

References

Like this:

NotePub

Indranagar,
Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

Introduction

Language Identification Techniques

Dificulties in Language Identification Process

A distinction of similar languages

Detection of multilingualism in a single document.

Identifying the language of short texts.

References

Share this:

Like this:

NotePub

Indranagar, Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

Indranagar,
Bangalore - 560038, Karnataka, India