Table of Contents
hide
Word Normalization
Requires making decisions on how to represent variations of the same word or concept consistently.
- Examples of Normalization Decisions:
- Representing abbreviations (e.g., “USA” vs. “U.S.A.”).
- Standardizing speech transcriptions (e.g., “uh-huh” vs. “uh huh”).
- Handling word forms and lemmas (e.g., “am,” “is,” “are” vs. “be”).
- Case Folding: Converting all letters to lowercase is common in information retrieval to match user queries, but case can be crucial for meaning in other NLP tasks (e.g., “US” vs. “us”).
- Lemmatization: Reducing words to their base or dictionary form (lemma) by considering morphology (e.g., “am,” “are,” “is” “be”; “cars” “car”; Spanish “quiero,” “quieres” “querer”).
- Done by morphological parsing, breaking words into morphemes (smallest meaningful units).
- Stems: Core meaning-bearing units.
- Affixes: Parts attached to stems with grammatical functions.
- A morphological parser breaks down words into stems and affixes (e.g., “cats” “cat” + “s”; Spanish “amaren” “amar” + morphological features).
- Stemming: A simpler form of normalization that crudely chops off affixes without necessarily mapping to a true morphological lemma.
- Advantage: Simplicity.
- Disadvantage: Can lead to non-words and less accurate grouping (increases recall at the expense of precision).
- Example of Stemming: “This was not the map we found…” “Thi wa not the map we found…” (rough output showing suffix removal).
- Porter Stemmer: A standard stemming algorithm based on a series of rewrite rules applied in sequence (e.g., “-ational” “-ate,” remove “ing” after a vowel, “SSES” “S”).
- Challenges with Complex Morphology: Languages like Turkish with agglutinative morphology (long words with many morphemes) require richer morphological parsing algorithms beyond simple affix removal.
Sentence Segmentation
Breaking text into larger discourse units like sentences
- Often achievable using final punctuation marks (!, ?).
- The period (.) is more ambiguous (sentence boundary vs. abbreviation vs. numbers).
- Common Algorithm:
- Tokenization: First, the text is broken into tokens.
- Classification of Periods: A rule-based system or a machine learning classifier determines if a period is part of a word (abbreviation, number) or a sentence boundary. An abbreviation dictionary can be helpful.
- Sentence Segmentation: Based on the more accurate tokenization, simple rules are applied to segment sentences.
Conclusion:
Normalizing words and segmenting sentences are crucial initial steps in text processing for various NLP applications. They involve making informed decisions based on the task and language characteristics.