Development of Automatic Speech Recognition (ASR) system

Automatic Speech Recognition (ASR) is a technology that converts spoken language into text. It enables machines to understand and transcribe human speech, powering applications like voice assistants (e.g., Siri, Alexa), real-time transcription services, and dictation software. ASR systems process audio input, recognize patterns corresponding to words or phrases, and output text.

Table of Contents hide

1 ASR Modeling

1.1 Acoustic Model:

1.2 Pronunciation Model (Lexicon):

1.3 Language Model:

1.4 Decoder:

2 ASR Modeling Approaches

3 Design considerations

4 Speech Encoder Architectures

4.1 Architectural Components

4.2 Practical Insights for Developers/Researchers

5 Speech Decoder Architectures

5.1 Architectural Components

5.2 Practical Insights for Developers/Researchers

6 Example Code

7 Key Design Considerations for Developers/Researchers

8 Taxonomy of Frameworks for ASR

9 Datasets for ASR

10 Popular Model Releases for ASR

11 Key Papers for ASR

12 Applications of ASR

ASR Modeling

ASR modeling refers to the process of designing and training models to convert spoken audio into text. It involves multiple interconnected components that handle audio processing, phonetic mapping, and language understanding. The primary components are the acoustic model, pronunciation model, and language model, often combined in a pipeline or end-to-end system.

Acoustic Model:

- Maps raw audio signals to phonetic units (e.g., phonemes, subwords, or characters).
- Input: Audio features like Mel-Frequency Cepstral Coefficients (MFCCs), log-mel filterbanks, or raw waveforms.
- Techniques:
  - Traditional: Gaussian Mixture Models (GMMs) with Hidden Markov Models (HMMs) to model temporal sequences of phonemes.
  - Modern: Deep learning models like Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), or Transformers (e.g., Conformer, Wav2Vec 2.0). These learn directly from audio features.
  - End-to-End: Models like Connectionist Temporal Classification (CTC), RNN-Transducer (RNN-T), or Sequence-to-Sequence (Seq2Seq) with attention map audio to text without explicit phonetic steps.
- Training: Uses large audio-text paired datasets (e.g., LibriSpeech, CommonVoice). Self-supervised pretraining (e.g., Wav2Vec, HuBERT) enhances robustness with unlabeled data.
- Challenges: Noise, accents, and speaker variability require data augmentation and domain adaptation.

Pronunciation Model (Lexicon):

- Maps phonetic units to words, defining how words are pronounced as phoneme sequences.
- Structure: A dictionary (e.g., CMUdict) or Grapheme-to-Phoneme (G2P) model for generating pronunciations dynamically.
- Techniques:
  - Static lexicons for known words.
  - G2P models (e.g., Transformer-based or finite-state transducers) for out-of-vocabulary (OOV) words, proper nouns, or non-standard orthographies.
- Integration: Links acoustic model outputs (phonemes) to language model inputs (words). In end-to-end systems, pronunciation may be implicitly learned.
- Challenges: Handling homophones, accents, and dialectal variations requires multiple pronunciation variants or adaptive G2P.

Language Model:

- Predicts the likelihood of word sequences, constraining the ASR output to fluent and contextually appropriate text.
- Techniques:
  - Traditional: N-gram models estimate word sequence probabilities.
  - Modern: Neural language models like LSTMs, Transformers (e.g., BERT, GPT), or large language models (LLMs) model long-range dependencies.
- Integration: In traditional ASR, combined with acoustic and pronunciation models via a decoding graph (e.g., Weighted Finite-State Transducers). In end-to-end systems, often integrated into the model architecture (e.g., Seq2Seq decoder).
- Challenges: Domain-specific vocabulary and rare words require fine-tuning or external lexicons.

Decoder:

- Combines outputs from the acoustic, pronunciation, and language models to produce the final transcription.
- Techniques:
  - Traditional: Viterbi decoding or beam search over an HMM-based decoding graph.
  - Modern: Beam search in end-to-end models (e.g., CTC, RNN-T) or attention-based decoding in Seq2Seq models.
- Challenges: Balancing speed, accuracy, and computational cost during decoding.

ASR Modeling Approaches

Traditional (Hybrid) Systems:

- Components (acoustic, pronunciation, language models) are trained separately and combined during decoding.
- Pros: Modular, interpretable, works well with limited data.
- Cons: Complex pipeline, requires manual alignment, less flexible for end-to-end optimization.

End-to-End Systems:

- Single neural network (e.g., CTC, RNN-T, Transformer-based) maps audio directly to text, implicitly learning acoustic, pronunciation, and language modeling.
- Pros: Simplified training, better performance with large datasets, adaptable to low-resource languages.
- Cons: Data-hungry, less interpretable, may struggle with rare words without external lexicons.
- Examples: Deep Speech, Whisper, Wav2Vec 2.0.

Self-Supervised Learning:

- Pretrain models on unlabeled audio (e.g., Wav2Vec, HuBERT) to learn general audio representations, then fine-tune on labeled data.
- Pros: Reduces need for labeled data, improves robustness.
- Cons: Requires significant computational resources for pretraining.

Design considerations

Robustness to Noise, Accents, and Low-Resource Languages
- Ensure models handle diverse audio conditions (background noise, varying quality) and speaker variability (accents, dialects).
- Prioritize training on diverse datasets, including low-resource languages, to improve generalization.
- Implement data augmentation (e.g., noise injection, accent synthesis) and transfer learning to address data scarcity.
Computational Efficiency for Real-Time or Edge Deployment
- Optimize models for low latency and minimal resource usage to enable real-time processing on edge devices.
- Use techniques like model pruning, quantization, or lightweight architectures (e.g., distilled models) to reduce computational overhead.
- Balance accuracy with inference speed to meet deployment constraints (e.g., IoT, mobile devices).
Scalability for Multilingual and Large-Scale Datasets
- Design systems to efficiently process large, multilingual datasets with modular architectures or language-agnostic representations.
- Leverage distributed training and data pipelines to handle scale while maintaining performance.
- Incorporate continual learning to adapt to new languages or domains without retraining from scratch.
Trade-offs Between E2E Simplicity and Hybrid System Flexibility
- End-to-End (E2E) Systems: Favor simplicity, unified training, and reduced engineering complexity but may lack adaptability for specialized tasks.
- Hybrid Systems: Offer flexibility to optimize individual components (e.g., acoustic, language models) but require more integration effort.
- Evaluate task requirements: E2E for streamlined deployment, hybrid for fine-grained control or domain-specific performance.

Speech Encoder Architectures

The encoder transforms raw audio (waveforms) or preprocessed features (e.g., log-mel spectrograms, MFCCs) into a high-dimensional latent representation capturing phonetic, temporal, and contextual information.

Input:

Raw audio (e.g., 16kHz waveforms) or features like 80-dimensional log-mel spectrograms.
Preprocessing often includes Short-Time Fourier Transform (STFT) or Mel-filter banks.

Output:

Sequence of embeddings (e.g., $H = [h_1, h_2, \dots, h_T]$

, where $T$

is the number of time steps and $h_t$

is a feature vector).

Architectural Components

Convolutional Neural Networks (CNNs):

Extract local spectral and temporal features from audio inputs (e.g., spectrograms).
Design:
- Multiple 1D/2D convolutional layers with ReLU activations.
- Strided convolutions for downsampling (reducing temporal resolution).
- Example: QuartzNet (NeMo) uses 1D time-channel separable convolutions for efficiency.
Math:
- Input: Spectrogram $X \in R^{F \times T(F = frequency bins, T = time steps).}$
- Output: $H = \text{Conv}(X) \in \mathbb{R}^{C \times T’}$
  
  , where $C$
  
  is the number of channels, $T’ \leq T$
  
  .
Use Case: Lightweight models for edge devices (e.g., QuartzNet, CitriNet).
Pros: Computationally efficient, captures local patterns.
Cons: Limited long-range dependency modeling.

Recurrent Neural Networks (RNNs):

Model temporal dependencies in audio sequences.
Design:
- LSTMs or GRUs to handle vanishing gradients.
- Bidirectional RNNs for better context (e.g., Deep Speech 2).
- Stacked layers (3–5) for deeper modeling.
Math:
- Input: Sequence $X = [x_1, x_2, \dots, x_T]$
  
  .
- Output: $h_t = \text{LSTM}(x_t, h_{t-1})$
  
  , where $h_t$
  
  $h_{t}$ is the hidden state.
Use Case: Early E2E models (e.g., Deep Speech, LAS).
Pros: Strong temporal modeling.
Cons: Sequential processing limits parallelization, high latency.

Transformers:

Capture long-range dependencies using self-attention.
Design:
- Multi-head self-attention (MHSA) to model relationships across time steps.
- Positional encodings to preserve temporal order.
- Stacked layers (6–24) with feed-forward networks (FFNs).
- Example: Whisper uses a Transformer encoder with 4–32 layers depending on model size (Tiny to Large).
Math:
- Self-attention: $\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$
  
  , where $Q, K, V$
  
  are query, key, value matrices.
- Output: $H = \text{Transformer}(X) \in \mathbb{R}^{T \times d}$
  
  , where $d$
  
  is the embedding dimension.
Use Case: Multilingual models (e.g., Whisper, Granite Speech).
Pros: Parallelizable, excels at long-range dependencies.
Cons: High memory/compute cost, less effective for local patterns.

Conformers:

Combine CNNs and Transformers for local and global context.
Design:
- Conformer block: FFN, MHSA, convolution module, FFN.
- Convolution module captures local features (e.g., 3×3 kernels).
- Example: FastConformer in NeMo optimizes for speed and accuracy.
Math:
- Conformer block: $h’ = h + \text{MHSA}(h) + \text{Conv}(h) + \text{FFN}(h)$
  
  .
- Output: Similar to Transformer but with enhanced local modeling.
Use Case: High-accuracy, low-latency ASR (e.g., ESPnet, NeMo).
Pros: Balances local/global modeling, efficient for speech.
Cons: Complex implementation, higher parameter count.

Self-Supervised Pretraining:

Learn robust representations from unlabeled audio.
Design:
- Contrastive loss (Wav2Vec 2.0): Predict masked audio frames.
- Masked prediction (HuBERT): Predict discrete hidden units.
- Example: Granite Speech 3.3 8B uses self-supervised pretraining for its encoder, enhancing multilingual robustness.
Math:
- Contrastive loss: $\mathcal{L} = -\log \frac{\exp(\text{sim}(z_t, c_t))}{\sum_{c’ \in C} \exp(\text{sim}(z_t, c’))}$
  
  , where $z_t$
  
  is the encoded frame, $c_t$
  
  is the context.
Use Case: Low-resource languages, transfer learning.
Pros: Reduces need for labeled data.
Cons: Requires large unlabeled datasets, compute-intensive.

Practical Insights for Developers/Researchers

Choosing an Encoder:
- Use CNNs for lightweight, edge-based ASR (e.g., QuartzNet for mobile devices).
- Use Transformers for multilingual, high-accuracy tasks (e.g., Whisper, Granite Speech).
- Use Conformers for real-time, high-performance ASR (e.g., NeMo’s FastConformer).
Implementation Tips:
- Apply SpecAugment (time/frequency masking) to augment spectrograms for robustness.
- Use mixed-precision training (e.g., FP16) to reduce memory for Transformer/Conformer encoders.
- Fine-tune self-supervised encoders (e.g., Wav2Vec, Granite Speech) on domain-specific data.
Challenges:
- Handling variable audio lengths (use padding or dynamic batching).
- Ensuring robustness to noise (train on diverse datasets like VoxPopuli).
- Optimizing for low-resource languages (leverage self-supervised pretraining).

Speech Decoder Architectures

The decoder generates text sequences (characters, subwords, or words) from the encoder’s latent representations, aligning audio features with linguistic outputs.

Input:

Encoder outputs (e.g., $H = [h_1, h_2, \dots, h_T]$

$H = [h_{1}, h_{2}, \dots, h_{T}]$ ).
Previous text tokens (in autoregressive decoding).

Output:

Sequence of text tokens (e.g., $Y = [y_1, y_2, \dots, y_N]$

, where $N$

is the output length).

Architectural Components

RNN-based Decoders:

Generate text sequences with attention to align encoder outputs.
Design:
- LSTMs/GRUs for sequential text generation.
- Attention mechanism (e.g., Bahdanau or Luong) to focus on relevant encoder states.
- Example: LAS (Listen, Attend, Spell) uses bidirectional LSTM encoder with an attentional LSTM decoder.
Math:
- Attention: $\alpha_{t,i} = \text{softmax}(\text{score}(s_t, h_i))$
  
  , where $s_t$
  
  $s_{t}$ is the decoder state, $h_i$
  
  is the encoder state.
- Output: $y_t = \text{argmax}(\text{softmax}(W s_t + b))$
  
  .
Use Case: Early seq2seq models (e.g., LAS, RNN-T).
Pros: Strong sequential modeling, effective for short sequences.
Cons: Slow due to autoregressive nature, limited parallelization.

Transformer-based Decoders:

Generate text using self-attention and cross-attention for alignment.
Design:
- Stacked Transformer layers with masked self-attention (for autoregressive generation) and cross-attention to encoder outputs.
- Positional encodings for text sequence order.
- Example: Whisper uses a Transformer decoder (4–32 layers) for multilingual ASR/AST.
Math:
- Masked self-attention: Prevents attending to future tokens.
- Cross-attention: $\text{Attention}(Q_{\text{dec}}, K_{\text{enc}}, V_{\text{enc}})$
  
  .
- Output: $p(y_t | y_{<t}, H) = \text{softmax}(W h_t^{\text{dec}} + b)$
  
  .
Use Case: Multilingual, high-accuracy models (e.g., Whisper, Granite Speech, SpeechT5).
Pros: Parallelizable, handles long sequences well.
Cons: Memory-intensive, prone to hallucinations.

Connectionist Temporal Classification (CTC):

Simplifies decoding by assuming monotonic alignment, bypassing explicit decoder.
Design:
- Linear layer + softmax on encoder outputs to predict character/subword probabilities.
- CTC loss aligns audio frames to text without explicit segmentation.
- Example: Wav2Vec 2.0, Deep Speech use CTC for E2E training.
Math:
- CTC loss: $\mathcal{L}_{\text{CTC}} = -\log \sum_{\pi \in \mathcal{B}^{-1}(y)} p(\pi | X)$
  
  , where $\mathcal{B}$
  
  maps paths $\pi$
  
  to target sequence $y$
  
  .
- Decoding: Beam search or greedy decoding to collapse repeated tokens and remove blanks.
Use Case: Lightweight, low-latency ASR (e.g., Wav2Vec, QuartzNet).
Pros: Fast, no separate decoder training.
Cons: Struggles with non-monotonic alignments, less flexible for translation tasks.

RNN-Transducer (RNN-T):

Combines encoder, prediction network (decoder-like), and joint network for streaming ASR.
Design:
- Prediction network (RNN/Transformer) generates text embeddings based on previous tokens.
- Joint network combines encoder and prediction outputs for token prediction.
- Example: Used in Google Voice, NeMo’s RNN-T models.
Math:
- Prediction network: $g_u = \text{RNN}(y_{u-1}, g_{u-1})$
  
  .
- Joint network: $z_{t,u} = \text{tanh}(W [h_t; g_u] + b)$
  
  $.$
- Output: $p(y_u | h_t, y_{<u}) = \text{softmax}(z_{t,u})$
  
  .
Use Case: Streaming, low-latency ASR.
Pros: Streaming-friendly, balances speed and accuracy.
Cons: Complex training, higher memory than CTC.

Decoder-Only Architectures:

Emerging for streaming ASR, using discrete speech units or text tokens.
Design:
- Trained on discrete audio representations (e.g., HuBERT’s hidden units).
- Autoregressive generation similar to LLMs.
- Example: SpeechGPT, VioLA explore decoder-only ASR.
Math:
- Similar to Transformer decoder but operates on discrete speech units.
- Loss: $\mathcal{L} = -\sum_t \log p(y_t | y_{<t}, X)$
  
  $.$
Use Case: Experimental, streaming ASR.
Pros: Simplified architecture, potential for multi-modal integration.
Cons: Limited maturity, data-hungry.

Practical Insights for Developers/Researchers

Choosing a Decoder:
- Use CTC for lightweight, low-latency ASR (e.g., Wav2Vec for edge devices).
- Use Transformer decoders for multilingual, high-accuracy tasks (e.g., Whisper, Granite Speech for transcription/translation).
- Use RNN-T for streaming applications (e.g., voice assistants).
Implementation Tips:
- Apply label smoothing or scheduled sampling to stabilize Transformer decoder training.
- Use beam search (beam width 5–10) for better accuracy, or greedy decoding for speed.
- For CTC, implement prefix beam search to handle long sequences efficiently.
- Fine-tune decoder layers (e.g., Granite Speech’s LoRA adapters) for domain-specific tasks.
Challenges:
- Decoder hallucinations (e.g., Whisper inserting incorrect text) require post-processing or constrained decoding.
- Streaming decoders (RNN-T, decoder-only) need careful latency optimization.
- Multilingual decoding requires large vocabularies (e.g., SentencePiece for subword tokenization).

Example Code

import torch
import torch.nn as nn

class CNNSpeechEncoder(nn.Module):
    def __init__(self, in_channels=80, out_dim=512):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 64, kernel_size=3, stride=2, padding=1),
            nn.ReLU()
        )
        self.linear = nn.Linear(64 * (in_channels // 4), out_dim)

    def forward(self, x):
        # x: [batch, 1, freq, time] (e.g., log-mel spectrogram)
        x = self.conv(x)  # [batch, 64, freq//4, time//4]
        x = x.permute(0, 3, 1, 2).reshape(x.size(0), x.size(3), -1)  # [batch, time//4, 64*(freq//4)]
        x = self.linear(x)  # [batch, time//4, out_dim]
        return x

class TransformerEncoder(nn.Module):
    def __init__(self, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead), num_layers
        )
        self.pos_encoding = nn.Parameter(torch.randn(1, 1000, d_model))

    def forward(self, x):
        # x: [batch, time, d_model]
        x = x + self.pos_encoding[:, :x.size(1), :]
        x = self.transformer(x)  # [batch, time, d_model]
        return x

class CTCDecoder(nn.Module):
    def __init__(self, in_dim=512, vocab_size=100):
        super().__init__()
        self.linear = nn.Linear(in_dim, vocab_size)

    def forward(self, x):
        # x: [batch, time, in_dim]
        logits = self.linear(x)  # [batch, time, vocab_size]
        return logits

class ASREncoderDecoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder_cnn = CNNSpeechEncoder()
        self.encoder_transformer = TransformerEncoder()
        self.decoder = CTCDecoder()

    def forward(self, x):
        x = self.encoder_cnn(x)
        x = self.encoder_transformer(x)
        logits = self.decoder(x)
        return logits

# Example usage
model = ASREncoderDecoder()
x = torch.randn(2, 1, 80, 400)  # Batch of 2 spectrograms
logits = model(x)  # [2, time//4, vocab_size]

Training:

Use CTC loss: torch.nn.CTCLoss.
Dataset: LibriSpeech (http://www.openslr.org/12/).
Optimizer: Adam with learning rate 1e-4, warmup schedule.

Inference:

Apply beam search or greedy decoding to collapse CTC outputs.
Use libraries like torchaudio for audio preprocessing.

Key Design Considerations for Developers/Researchers

Encoder-Decoder Interaction:
- Attention Mechanisms: Cross-attention in Transformer decoders (e.g., Whisper) aligns encoder outputs with text. Streaming models use monotonic attention (e.g., MoChA).
- Alignment: CTC assumes monotonic alignment, while seq2seq models handle non-monotonic cases (e.g., translation).
- Training Objectives:
  - Encoder: Contrastive or masked prediction for self-supervised learning.
  - Decoder: Cross-entropy (seq2seq), CTC, or transducer loss.
Optimization Techniques:
- **Pruning for low-resource settings (e.g., Wav2Vec fine-tuning).
- Mixed-precision training to reduce memory (e.g., FP16 in PyTorch).
- Gradient clipping to stabilize training.
Scalability:
- Multilingual Models: Use large vocabularies (e.g., BPE with 50k subwords) and diverse datasets (e.g., Common Voice, VoxPopuli).
- Edge Deployment: Optimize for low memory (e.g., QuartzNet’s separable convolutions, Granite Speech’s LoRA).
Evaluation Metrics:
- Word Error Rate (WER): $\text{WER} = \frac{S + I + D}{N}$
  
  , where $S$
  
  = substitutions, $I$
  
  = insertions, $D$
  
  = deletions, $N$
  
  = total words.
- Character Error Rate (CER): Similar to WER but for characters.
- Real-Time Factor (RTF): $\text{RTF} = \frac{\text{Processing Time}}{\text{Audio Duration}}$
  
  $RTF = Audio Duration Processing Time$ .

Taxonomy of Frameworks for ASR

Frameworks are categorized based on their focus, architecture support, and use cases:

End-to-End (E2E) Focused Frameworks:
- Description: Support fully neural, E2E models (e.g., Transformer, Conformer, RNN-T).
- Examples:
  - ESPnet: Multilingual, supports Transformer, Conformer, RNN-T.
    - URL: https://github.com/espnet/espnet
  - SpeechBrain: PyTorch-based, supports Wav2Vec, HuBERT.
    - URL: https://github.com/speechbrain/speechbrain
  - NVIDIA NeMo: Scalable, supports FastConformer, QuartzNet.
    - URL: https://github.com/NVIDIA/NeMo
- Use Case: Research, production for modern E2E models.
Hybrid System Frameworks:
- Description: Support traditional pipeline models (HMM-DNN, lattice decoding).
- Examples:
  - Kaldi: Robust for HMM-DNN, pipeline-based ASR.
    - URL: https://github.com/kaldi-asr/kaldi
- Use Case: Low-resource languages, custom acoustic modeling.
General-Purpose Sequence Modeling Frameworks:
- Description: Support speech alongside other sequence tasks (e.g., NLP, translation).
- Examples:
  - Fairseq: Used for Wav2Vec, S2T.
    - URL: https://github.com/facebookresearch/fairseq
  - Hugging Face Transformers: Hosts Wav2Vec, Whisper, Granite Speech.
    - URL: https://github.com/huggingface/transformers
- Use Case: Rapid prototyping, fine-tuning, multi-modal tasks.
Specialized Production Frameworks:
- Description: Optimized for deployment, real-time, or edge ASR.
- Examples:
  - NVIDIA NeMo: Edge and cloud deployment.
    - URL: https://github.com/NVIDIA/NeMo
  - Hugging Face Transformers: Production-ready APIs.
    - URL: https://github.com/huggingface/transformers
- Use Case: Real-time applications, enterprise solutions.

Datasets for ASR

Key datasets with URLs, categorized by type and use case:

Large-Scale Supervised Datasets:
- LibriSpeech: 960 hours of English audiobooks (clean/noisy).
  - URL: http://www.openslr.org/12/
- GigaSpeech: 10,000 hours of English (audiobooks, podcasts, YouTube).
  - URL: https://github.com/SpeechColab/GigaSpeech
- Use Case: Benchmarking, large-scale training.
Multilingual Datasets:
- Common Voice: Crowdsourced, 100+ languages.
  - URL: https://commonvoice.mozilla.org/
- VoxPopuli: 400k hours unlabeled, 100k hours labeled (European languages).
  - URL: https://github.com/facebookresearch/voxpopuli
- MuST-C: Multilingual (8 languages, TED talks).
  - URL: https://ict.fbk.eu/must-c/
- Use Case: Multilingual ASR, translation.
Conversational Datasets:
- Switchboard: 300 hours of English telephone conversations.
  - URL: https://catalog.ldc.upenn.edu/LDC97S62 (LDC access)
- Fisher: 2,000 hours of English telephone speech.
  - URL: https://catalog.ldc.upenn.edu/LDC2004S13 (LDC access)
- Use Case: Robust conversational ASR.
Unlabeled/Self-Supervised Datasets:
- Libri-Light: 60,000 hours of unlabeled English audio.
  - URL: https://github.com/facebookresearch/libri-light
- Use Case: Self-supervised pretraining.
Domain-Specific Datasets:
- TED-LIUM: 450 hours of TED talk audio.
  - URL: https://www.openslr.org/51/
- MInDS-14: Multilingual for intent detection and ASR.
  - URL: https://huggingface.co/datasets/PolyAI/minds14
- Use Case: Formal speech, task-specific fine-tuning.
Low-Resource Datasets:
- IARPA Babel, OpenSLR: Languages like Swahili, Tamil.
  - URL: http://www.openslr.org/
- Use Case: Low-resource language research.

Popular Model Releases for ASR

Key models with URLs, highlighting Granite Speech:

Granite Speech 3.3 8B (IBM):
- Modular architecture with a speech encoder and LoRA-based audio adapters for efficient fine-tuning.
- Supports ASR and Automatic Speech Translation (AST) in 7+ languages.
- Outperforms Whisper, GPT-4o, Gemini 2.0 Flash in transcription/translation (per IBM benchmarks).
- Open-source under Apache 2.0, enterprise-grade.
- URL: https://huggingface.co/ibm-granite/granite-3.3-8b-speech | https://www.ibm.com/think/topics/granite
- Strength: Multilingual, efficient fine-tuning, robust for enterprise use.
Wav2Vec 2.0 (Facebook AI):
- Self-supervised, CTC-based, pre-trained on LibriVox.
- Variants: Base (96M params), Large (317M params).
- URL: https://huggingface.co/models?filter=wav2vec2
- Strength: Low-resource robustness.
Whisper (OpenAI):
- E2E Transformer for multilingual ASR/translation (680k hours data).
- Variants: Tiny, Base, Small, Medium, Large (up to 1.55B params).
- URL: https://github.com/openai/whisper | https://huggingface.co/models?filter=whisper
- Strength: 99+ languages, noise/accent robustness.
HuBERT (Facebook AI):
- Self-supervised, fine-tuned for ASR (CTC or seq2seq).
- URL: https://github.com/facebookresearch/fairseq/tree/main/examples/hubert | https://huggingface.co/models?filter=hubert
- Strength: Low-resource datasets.
Conformer (Google):
- Combines CNNs and Transformers.
- URL: https://github.com/espnet/espnet | https://github.com/NVIDIA/NeMo
- Strength: High accuracy, low latency.
Deep Speech 2 (Baidu):
- Early E2E model (RNNs, CTC).
- URL: https://github.com/PaddlePaddle/DeepSpeech
- Strength: Simplified E2E architecture.
QuartzNet and CitriNet (NVIDIA):
- Lightweight, separable convolutions.
- URL: https://github.com/NVIDIA/NeMo | https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models
- Strength: Edge, real-time ASR.
RNN-Transducer (RNN-T):
- Streaming ASR with RNNs and joint networks.
- URL: https://github.com/espnet/espnet | https://github.com/NVIDIA/NeMo
- Strength: Low-latency, production-ready.
SpeechT5 (Microsoft):
- Multi-modal for ASR, synthesis, translation.
- URL: https://huggingface.co/microsoft/speecht5_asr
- Strength: Multi-task speech processing.

Key Papers for ASR

Foundational and recent papers with URLs:

Granite Speech 3.3 8B (IBM):
- Title: Not publicly detailed in a specific paper, but referenced in IBM’s Granite documentation.
- URL: https://www.ibm.com/think/topics/granite
- Focus: Modular speech models for ASR/AST.
Wav2Vec 2.0:
- Title: “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (Baevski et al., 2020).
- URL: https://arxiv.org/abs/2006.11477
- Focus: Self-supervised speech representation learning.
Whisper:
- Title: “Robust Speech Recognition via Large-Scale Weak Supervision” (Radford et al., 2022).
- URL: https://arxiv.org/abs/2212.04356
- Focus: Multilingual, weakly supervised ASR.
Conformer:
- Title: “Conformer: Convolution-augmented Transformer for Speech Recognition” (Gulati et al., 2020).
- URL: https://arxiv.org/abs/2005.08100
- Focus: Efficient speech modeling with CNNs and Transformers.
HuBERT:
- Title: “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units” (Hsu et al., 2021).
- URL: https://arxiv.org/abs/2106.07447
- Focus: Self-supervised learning for speech.
Deep Speech 2:
- Title: “Deep Speech 2: End-to-End Speech Recognition in English and Mandarin” (Amodei et al., 2016).
- URL: https://arxiv.org/abs/1512.02595
- Focus: Early E2E ASR.
RNN-Transducer:
- Title: “Sequence Transduction with Recurrent Neural Networks” (Graves, 2012).
- URL: https://arxiv.org/abs/1211.3711
- Focus: Streaming ASR foundations.

Applications of ASR

ASR applications with examples and URLs where applicable:

Voice Assistants:
- Description: Real-time transcription for virtual assistants (e.g., Siri, Google Assistant).
- Models: RNN-T, FastConformer.
- URL: https://cloud.google.com/speech-to-text (Google’s API)
- Use Case: Low-latency, conversational ASR.
Transcription Services:
- Description: Automatic subtitles, meeting notes (e.g., Zoom, Otter.ai).
- Models: Whisper, Granite Speech.
- URL: https://www.otter.ai/
- Use Case: Accurate, multilingual transcription.
Speech Translation:
- Description: Real-time translation of spoken language (e.g., Google Translate, Granite Speech AST).
- Models: Whisper, Granite Speech.
- URL: https://translate.google.com/
- Use Case: Multilingual communication.
Accessibility:
- Description: Speech-to-text for hearing-impaired users (e.g., live captions).
- Models: Whisper, SpeechT5.
- URL: https://www.microsoft.com/en-us/ai/azure-ai-speech
- Use Case: Inclusive technology.
Call Center Analytics:
- Description: Transcribing and analyzing customer calls for insights.
- Models: Granite Speech, Wav2Vec.
- URL: https://www.ibm.com/watson
- Use Case: Enterprise analytics, sentiment analysis.
Medical Transcription:
- Description: Transcribing doctor-patient interactions.
- Models: Whisper, Granite Speech (fine-tuned).
- URL: https://www.nuance.com/healthcare.html
- Use Case: Domain-specific, high-accuracy transcription.
Education:
- Description: Lecture transcription, language learning tools.
- Models: SpeechT5, Common Voice-based models.
- URL: https://www.duolingo.com/ (speech features)
- Use Case: Scalable, multilingual education tools.

Trends and Notes

Self-Supervised Learning: Wav2Vec, HuBERT, and Granite Speech leverage unlabeled audio for low-resource robustness.
Multilingual ASR: Granite Speech (7+ languages), Whisper (99+ languages) address global use cases.
Real-Time ASR: RNN-T, streaming Conformers, and decoder-only models (e.g., VioLA) enable low-latency applications.
On-Device ASR: Lightweight models (e.g., QuartzNet, Whisper Tiny, Granite Speech with LoRA) support edge deployment.
Challenges:
- Robustness to accents, noise, and dialects.
- Low-resource languages require transfer learning.
- Decoder hallucinations (e.g., Whisper inserting incorrect text).
Evaluation Metrics:
- Word Error Rate (WER): (Substitutions + Insertions + Deletions) / Total Words.
- Character Error Rate (CER): For non-word scripts.
- Real-Time Factor (RTF): Inference speed.

Conclusion

Speech encoders and decoders are complementary components of ASR systems, with encoders focusing on robust feature extraction and decoders on accurate text generation. Developers and researchers must balance accuracy, latency, and resource constraints, leveraging modern architectures (e.g., Conformers, Transformers) and self-supervised pretraining (e.g., Granite Speech, Wav2Vec). Tools like ESPnet, NeMo, and Hugging Face Transformers, combined with datasets like LibriSpeech and VoxPopuli, enable rapid prototyping and deployment.

CITE THIS AS:

“Development of Automatic Speech Recognition (ASR) system – 2025” From NotePub.io – Publish & Share Note! https://notepub.io/articles/development-of-automatic-speech-recognition-asr-system/

Development of Automatic Speech Recognition (ASR) system

ASR Modeling

Acoustic Model:

Pronunciation Model (Lexicon):

Language Model:

Decoder:

ASR Modeling Approaches

Design considerations

Speech Encoder Architectures

Architectural Components

Practical Insights for Developers/Researchers

Speech Decoder Architectures

Architectural Components

Practical Insights for Developers/Researchers

Example Code

Key Design Considerations for Developers/Researchers

Taxonomy of Frameworks for ASR

Datasets for ASR

Popular Model Releases for ASR

Key Papers for ASR

Applications of ASR

Trends and Notes

Conclusion

Like this:

NotePub

Indranagar,
Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

ASR Modeling

Acoustic Model:

Pronunciation Model (Lexicon):

Language Model:

Decoder:

ASR Modeling Approaches

Design considerations

Speech Encoder Architectures

Architectural Components

Practical Insights for Developers/Researchers

Speech Decoder Architectures

Architectural Components

Practical Insights for Developers/Researchers

Example Code

Key Design Considerations for Developers/Researchers

Taxonomy of Frameworks for ASR

Datasets for ASR

Popular Model Releases for ASR

Key Papers for ASR

Applications of ASR

Trends and Notes

Conclusion

Share this:

Like this:

NotePub

Indranagar, Bangalore - 560038, Karnataka, India

Write Us: [email protected]

Essentials

About Us

Contact Us

Private Policy

Copyright Policy

Assets

Notes

Articles

Questions

Projects

Indranagar,
Bangalore - 560038, Karnataka, India