All about Granite-speech-3.3-8b model for ASR

The Granite Speech 3.3 8B model, released by IBM in April 2025, is an open-source speech-to-text (STT) and automatic speech translation (AST) model designed for enterprise applications. It excels in automatic speech recognition (ASR) and translation, particularly for English and major languages like French, Spanish, Italian, German, Portuguese, Japanese, and Mandarin. Below is a detailed overview of the model, its architecture, benchmark datasets, and a comparison with other open-domain models.

Overview of Granite Speech 3.3 8B

The model is built for enterprise use cases requiring robust speech processing, such as transcription, translation, and integration into workflows like customer service or document processing. It is optimized for long audio sequences, reducing artifacts and improving accuracy compared to Whisper-based systems.

Capabilities:

  • ASR: High transcription accuracy for English speech.
  • AST: Translates English audio to multiple languages with improved quality.
  • Modular Design: Enhances safety by limiting the influence of audio inputs, reducing risks from adversarial prompts.

Availability: Released under the Apache 2.0 license, it is accessible on Hugging Face, IBM’s watsonx.ai, and platforms like Ollama, LMStudio, and Replicate.

Limitations:

  • May produce unreliable outputs with greedy decoding (num_beams=1).
  • Occasional hallucinations on very short audio inputs (<0.1s), which are under investigation.

Model Architecture

Granite Speech 3.3 8B, the conformer-based speech encoder transforms raw audio waveforms into a sequence of embeddings that capture temporal and contextual audio features, which are then passed to downstream components (e.g., the speech projector and LLM). It consists of 10 conformer blocks trained with Connectionist Temporal Classification (CTC). Each conformer block is a modular unit with the following components:

  1. Feed-Forward Module (FFM):
    • A pointwise feed-forward layer at the beginning of the block.
    • Applies linear transformations and non-linear activations (e.g., ReLU or Swish) to enhance feature representation.
    • Captures static patterns in the audio signal.
  2. Multi-Head Self-Attention (MHSA):
    • A transformer-style self-attention mechanism that models global dependencies across the audio sequence.
    • Allows the model to focus on relevant parts of the audio, such as phonemes or words, regardless of their temporal distance.
    • Uses multiple attention heads to capture different aspects of the audio context.
  3. Convolution Module (Conv):
    • A depthwise separable convolution layer that captures local temporal patterns in the audio signal.
    • Effective for modeling short-range dependencies, such as the spectral characteristics of speech.
    • Includes a 1D convolution with a kernel size designed to cover short time frames.
  4. Feed-Forward Module (FFM):
    • A second feed-forward layer at the end of the block, mirroring the first FFM.
    • Refines the features before passing them to the next block.
  5. Layer Normalization and Residual Connections:
    • Each module (FFM, MHSA, Conv) is followed by layer normalization to stabilize training.
    • Residual connections are added around each module to improve gradient flow and prevent vanishing gradients.

The conformer block structure can be summarized as:

  • FFM → MHSA → Conv → FFM, with residual connections and normalization.

The 10 conformer blocks are stacked sequentially, allowing the encoder to progressively refine audio features from low-level acoustics to high-level contextual representations.

Role in Granite Speech 3.3 8B

  1. Input Processing:
    • Takes raw audio waveforms (e.g., 16 kHz mono audio) as input.
    • Applies preprocessing (e.g., spectrogram extraction or feature normalization) to convert waveforms into a format suitable for the encoder.
  2. Feature Extraction:
    • The 10 conformer blocks process the audio to produce a sequence of embeddings.
    • These embeddings capture both local acoustic features (e.g., phoneme transitions) and global contextual information (e.g., sentence-level structure).
  3. CTC Training:
    • The encoder is trained with Connectionist Temporal Classification (CTC), a loss function commonly used in ASR.
    • CTC aligns audio inputs with text outputs without requiring explicit frame-level annotations, enabling the encoder to learn robust speech representations.
    • During training, CTC helps the encoder predict a sequence of tokens (e.g., phonemes or characters) that can be decoded into text.
  4. Output:
    • The encoder outputs a sequence of embeddings that are passed to the speech projector (a 2-layer query transformer, or Q-former).
    • The projector maps these embeddings into a representation space compatible with the Granite 3.3 8B Instruct LLM for further processing (e.g., transcription or translation).

Advantages of the Conformer-Based Speech Encoder

The conformer architecture offers several advantages for speech processing in Granite Speech 3.3 8B:

  • Local and Global Modeling:
    • The convolution module captures fine-grained temporal patterns (e.g., formants, pitch), while the self-attention module models long-range dependencies (e.g., word or sentence context).
  • Robustness:
    • The hybrid CNN-transformer design makes the encoder resilient to noise and variations in audio quality, critical for enterprise use cases.
  • Efficiency:
    • The depthwise separable convolutions reduce computational overhead compared to fully connected layers, enabling faster processing of long audio sequences.
  • Scalability:
    • The modular design of conformer blocks allows stacking multiple layers (10 in this case) to balance model capacity and performance.

Training and Datasets

  • Training Objective:
    • The speech encoder is trained using CTC loss on ASR-focused datasets, which include audio-text pairs for transcription tasks.
    • The datasets are a mix of publicly available corpora (e.g., LibriSpeech, Common Voice) and synthetic datasets tailored for enterprise-grade speech tasks.
  • Data Diversity:
    • Covers multiple languages (e.g., English, French, Spanish, Japanese, Mandarin) to support both ASR and AST.
    • Emphasizes governance, risk, and compliance (GRC) criteria, ensuring ethical data usage.
  • Training Infrastructure:
    • Trained on IBM’s Blue Vela cluster with NVIDIA H100 GPUs, completing in 9 days on 32 GPUs.

Comparison to Other Speech Encoders

The conformer-based speech encoder in Granite Speech 3.3 8B differs from other approaches used in open-domain models:

  1. Whisper (OpenAI):
    • Uses a transformer-based encoder without convolutional components.
    • Relies on a large transformer stack for both encoding and decoding, which is less efficient for local feature extraction.
    • Granite’s conformer encoder outperforms Whisper in transcription accuracy and artifact reduction, especially for long audio.
  2. Wav2Vec 2.0 (Meta AI):
    • Employs a convolutional encoder followed by a transformer for self-supervised learning.
    • While effective for pretraining, it lacks the conformer’s hybrid CNN-transformer design, limiting its ability to balance local and global modeling.
    • Granite’s encoder is fine-tuned for enterprise tasks, offering better performance in targeted applications.
  3. HuBERT (Meta AI):
    • Similar to Wav2Vec, uses a convolutional encoder with transformer layers.
    • Optimized for self-supervised learning, not directly for end-to-end ASR/AST.
    • Granite’s conformer encoder is more tailored for supervised tasks like transcription and translation.

Limitations

  • Short Audio Inputs:
    • The encoder may produce unreliable outputs or hallucinations for very short audio clips (<0.1s), an issue under investigation by IBM.
  • Language Coverage:
    • While effective for major languages, its performance on low-resource languages may lag behind proprietary models like GPT-4o.
  • Decoding Dependency:
    • The encoder’s outputs require beam search decoding (num_beams > 1) for reliable transcription, as greedy decoding (num_beams = 1) can lead to errors.

Conclusion

The conformer-based speech encoder in Granite Speech 3.3 8B is a key component that enables robust speech processing for enterprise applications. With 10 conformer blocks trained via CTC, it effectively captures both local and global audio features, outperforming Whisper-based systems in transcription and translation tasks. Its hybrid CNN-transformer design balances efficiency and performance, making it ideal for long audio sequences and multilingual applications. While it has limitations with short audio inputs and decoding strategies, its integration with the Granite 3.3 8B Instruct LLM and modular architecture positions it as a leading open-source solution for speech-to-text and translation tasks.

Loading

Scroll to Top
Scroll to Top