Notepub

Is Perplexity a Good Measure for Evaluating Language Models?

Is Perplexity a Good Measure for Evaluating Language Models? Perplexity is a widely used and valuable metric for evaluating language models, but it has both strengths and limitations that affect how well it reflects model quality. Why Perplexity Is a Good Measure Measures Uncertainty: Perplexity quantifies how uncertain a model is when predicting the next […]

Is Perplexity a Good Measure for Evaluating Language Models? Read More »

From Languages to Information – Language Modeling

Language modeling is a technique in artificial intelligence and natural language processing (NLP) that involves predicting the probability of a sequence of words occurring in a sentence. Essentially, a language model analyzes large bodies of text data to learn the statistical patterns, syntax, semantics, and context of language, enabling it to predict or generate coherent

From Languages to Information – Language Modeling Read More »

From Languages to Information – Edit Distance

1. Introduction to Minimum Edit Distance: Minimum edit distance is a measure of string similarity. It quantifies how similar two strings are by calculating the minimum number of single-character edits required to change one word into the other. Applications: Spell Correction: Identifying the closest correct word to a misspelled input (e.g., “graffe” → “graph,” “grail,”

From Languages to Information – Edit Distance Read More »

From Languages to Information – Word Normalization

Word Normalization Requires making decisions on how to represent variations of the same word or concept consistently. Examples of Normalization Decisions: Representing abbreviations (e.g., “USA” vs. “U.S.A.”). Standardizing speech transcriptions (e.g., “uh-huh” vs. “uh huh”). Handling word forms and lemmas (e.g., “am,” “is,” “are” vs. “be”). Case Folding: Converting all letters to lowercase is common

From Languages to Information – Word Normalization Read More »

What are the normalization techniques in nlp?

Normalization, using techniques like lemmatization and stemming, proves incredibly useful in various Natural Language Processing (NLP) tasks. Here’s a breakdown of why and where these methods are beneficial: When to Use Text Normalization? Reducing Data Sparsity and Improving Generalization The Problem of Inflectional Variation: Words often appear in different forms (e.g., “run,” “running,” “ran”). Without

What are the normalization techniques in nlp? Read More »

Development of Automatic Speech Recognition (ASR) system

Automatic Speech Recognition (ASR) is a technology that converts spoken language into text. It enables machines to understand and transcribe human speech, powering applications like voice assistants (e.g., Siri, Alexa), real-time transcription services, and dictation software. ASR systems process audio input, recognize patterns corresponding to words or phrases, and output text. ASR Modeling ASR modeling

Development of Automatic Speech Recognition (ASR) system Read More »

All about Granite-speech-3.3-8b model for ASR

The Granite Speech 3.3 8B model, released by IBM in April 2025, is an open-source speech-to-text (STT) and automatic speech translation (AST) model designed for enterprise applications. It excels in automatic speech recognition (ASR) and translation, particularly for English and major languages like French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin. Below is a detailed

All about Granite-speech-3.3-8b model for ASR Read More »

Unsupervised Domain Clusters in Pretrained Language Models

The concept of unsupervised domain clusters in pretrained language models refers to the idea of grouping textual data into meaningful clusters or domains without relying on explicit labeled data. This is an important area of research, especially in the context of fine-tuning large language models (LLMs) for specific tasks or domains. Unsupervised Domain Clusters in

Unsupervised Domain Clusters in Pretrained Language Models Read More »

CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

The paper “CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training” presents an automated framework that iteratively improves the data mixture used for pre-training large language models by embedding and clustering large-scale unlabeled datasets into semantic groups, then evaluating model performance on these clusters to dynamically adjust sampling weights toward more informative or challenging data. The approach uses a smaller proxy model and a predictor to efficiently search the vast space of data mixtures without relying on explicit domain labels. Experiments training a 1-billion-parameter model on a 400-billion-token optimized mixture show a 2% performance gain over the state-of-the-art Llama-3.2-1B, with domain-specific optimization (e.g., Social Sciences) achieving up to 5% improvement. The paper includes detailed experimentation on reasoning benchmarks and introduces two new datasets—ClimbLab, a 1.2-trillion-token corpus clustered into 20 semantic groups, and ClimbMix, a compact 400-billion-token dataset optimized for efficient pre-training—demonstrating that CLIMB’s iterative, clustering-based refinement leads to superior model generalization and specialization under fixed compute budgets.

CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Read More »

Difference between Knowledge graph vs Deep learning based reasoning?

Knowledge graph (KG)-based reasoning and deep learning approaches are distinct paradigms for processing and reasoning about data, each with unique mechanisms, strengths, and applications. Since knowledge graphs are closely related to ontologies (ontologies often provide the schema for KGs), KG-based reasoning shares some similarities with ontology-based reasoning but has specific differences from deep learning. Below

Difference between Knowledge graph vs Deep learning based reasoning? Read More »

Is Perplexity a Good Measure for Evaluating Language Models?

From Languages to Information – Language Modeling

From Languages to Information – Edit Distance

From Languages to Information – Word Normalization

What are the normalization techniques in nlp?

Development of Automatic Speech Recognition (ASR) system

All about Granite-speech-3.3-8b model for ASR

Unsupervised Domain Clusters in Pretrained Language Models

CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Difference between Knowledge graph vs Deep learning based reasoning?

NotePub

Indranagar,
Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects