Notepub (Official)

Note published by Notepub's official team.

From Languages to Information – Word Normalization

Word Normalization Requires making decisions on how to represent variations of the same word or concept consistently. Examples of Normalization Decisions: Representing abbreviations (e.g., “USA” vs. “U.S.A.”). Standardizing speech transcriptions (e.g., “uh-huh” vs. “uh huh”). Handling word forms and lemmas (e.g., “am,” “is,” “are” vs. “be”). Case Folding: Converting all letters to lowercase is common

Loading

From Languages to Information – Word Normalization Read More »

What are the normalization techniques in nlp?

Normalization, using techniques like lemmatization and stemming, proves incredibly useful in various Natural Language Processing (NLP) tasks. Here’s a breakdown of why and where these methods are beneficial: When to Use Text Normalization? Reducing Data Sparsity and Improving Generalization The Problem of Inflectional Variation: Words often appear in different forms (e.g., “run,” “running,” “ran”). Without

Loading

What are the normalization techniques in nlp? Read More »

Development of Automatic Speech Recognition (ASR) system

Automatic Speech Recognition (ASR) is a technology that converts spoken language into text. It enables machines to understand and transcribe human speech, powering applications like voice assistants (e.g., Siri, Alexa), real-time transcription services, and dictation software. ASR systems process audio input, recognize patterns corresponding to words or phrases, and output text. ASR Modeling ASR modeling

Loading

Development of Automatic Speech Recognition (ASR) system Read More »

CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

The paper “CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training” presents an automated framework that iteratively improves the data mixture used for pre-training large language models by embedding and clustering large-scale unlabeled datasets into semantic groups, then evaluating model performance on these clusters to dynamically adjust sampling weights toward more informative or challenging data. The approach uses a smaller proxy model and a predictor to efficiently search the vast space of data mixtures without relying on explicit domain labels. Experiments training a 1-billion-parameter model on a 400-billion-token optimized mixture show a 2% performance gain over the state-of-the-art Llama-3.2-1B, with domain-specific optimization (e.g., Social Sciences) achieving up to 5% improvement. The paper includes detailed experimentation on reasoning benchmarks and introduces two new datasets—ClimbLab, a 1.2-trillion-token corpus clustered into 20 semantic groups, and ClimbMix, a compact 400-billion-token dataset optimized for efficient pre-training—demonstrating that CLIMB’s iterative, clustering-based refinement leads to superior model generalization and specialization under fixed compute budgets.

Loading

CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Read More »

Difference between Knowledge graph vs Deep learning based reasoning?

Knowledge graph (KG)-based reasoning and deep learning approaches are distinct paradigms for processing and reasoning about data, each with unique mechanisms, strengths, and applications. Since knowledge graphs are closely related to ontologies (ontologies often provide the schema for KGs), KG-based reasoning shares some similarities with ontology-based reasoning but has specific differences from deep learning. Below

Loading

Difference between Knowledge graph vs Deep learning based reasoning? Read More »

Scroll to Top
Scroll to Top