From Languages to Information – Word Normalization

Word Normalization

Requires making decisions on how to represent variations of the same word or concept consistently.

  • Examples of Normalization Decisions:
    • Representing abbreviations (e.g., “USA” vs. “U.S.A.”).
    • Standardizing speech transcriptions (e.g., “uh-huh” vs. “uh huh”).
    • Handling word forms and lemmas (e.g., “am,” “is,” “are” vs. “be”).
    • Case Folding: Converting all letters to lowercase is common in information retrieval to match user queries, but case can be crucial for meaning in other NLP tasks (e.g., “US” vs. “us”).
    • Lemmatization: Reducing words to their base or dictionary form (lemma) by considering morphology (e.g., “am,” “are,” “is” “be”; “cars” “car”; Spanish “quiero,” “quieres” “querer”).
      • Done by morphological parsing, breaking words into morphemes (smallest meaningful units).
      • Stems: Core meaning-bearing units.
      • Affixes: Parts attached to stems with grammatical functions.
      • A morphological parser breaks down words into stems and affixes (e.g., “cats” “cat” + “s”; Spanish “amaren” “amar” + morphological features).
    • Stemming: A simpler form of normalization that crudely chops off affixes without necessarily mapping to a true morphological lemma.
      • Advantage: Simplicity.
      • Disadvantage: Can lead to non-words and less accurate grouping (increases recall at the expense of precision).
      • Example of Stemming: “This was not the map we found…” “Thi wa not the map we found…” (rough output showing suffix removal).
    • Porter Stemmer: A standard stemming algorithm based on a series of rewrite rules applied in sequence (e.g., “-ational” “-ate,” remove “ing” after a vowel, “SSES” “S”).
    • Challenges with Complex Morphology: Languages like Turkish with agglutinative morphology (long words with many morphemes) require richer morphological parsing algorithms beyond simple affix removal.

Sentence Segmentation

Breaking text into larger discourse units like sentences

  • Often achievable using final punctuation marks (!, ?).
  • The period (.) is more ambiguous (sentence boundary vs. abbreviation vs. numbers).
  • Common Algorithm:
    1. Tokenization: First, the text is broken into tokens.
    2. Classification of Periods: A rule-based system or a machine learning classifier determines if a period is part of a word (abbreviation, number) or a sentence boundary. An abbreviation dictionary can be helpful.
    3. Sentence Segmentation: Based on the more accurate tokenization, simple rules are applied to segment sentences.

Conclusion:

Normalizing words and segmenting sentences are crucial initial steps in text processing for various NLP applications. They involve making informed decisions based on the task and language characteristics.

Loading

Scroll to Top
Scroll to Top