openai/whisper-large-v3-turbo - AI Audio Models Tool

Overview

Whisper-large-v3-turbo is an optimized, finetuned and pruned variant of OpenAI's Whisper large-v3 designed for much faster automatic speech recognition (ASR) with only a small quality trade-off. The model reduces the autoregressive decoder depth from 32 layers to 4, keeping the original encoder capacity while dramatically lowering decoding latency; it remains multilingual (≈99 languages) and is distributed under an MIT license on the Hugging Face Hub. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) The release targets real-world transcription workflows that need near–real-time throughput (streaming, batch subtitling, large-volume transcription) while preserving large-v3 quality for many high-resource languages. It integrates directly with Hugging Face Transformers (pipeline and model + processor APIs), supports sentence- and word-level timestamps, language forcing, temperature-fallback decoding strategies, and speedups via torch.compile and Flash-Attention / scaled-dot-product attention where available. The model file is provided in safetensors and is sized to be GPU-friendly for inference and quantization. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md))

Model Statistics

  • Downloads: 4,015,573
  • Likes: 2849
  • Pipeline: automatic-speech-recognition

License: mit

Model Details

Architecture and sizing: whisper-large-v3-turbo is an encoder–decoder (sequence-to-sequence) Transformer derived from OpenAI's Whisper large family. The turbo variant keeps a deep encoder but prunes the decoder from 32 layers down to 4, yielding an approximate parameter count of 809 million and a safetensors checkpoint around 1.62 GB. This design prioritizes parallelizable encoder work and a shallow, “fat” decoder to accelerate autoregressive decoding. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) Capabilities and behavior: the model performs multilingual speech recognition across ~99 languages, automatic language identification, sentence- and word-level timestamps, and supports both transcription and (limited) translation tasks — note that turbo was fine-tuned on transcription-only data and may show reduced translation quality versus the original large-v3. Decoding features supported in the Transformers integration include temperature fallback, condition_on_prev_tokens toggles, logprob/no_speech thresholds, and explicit language/task arguments. Performance can be further improved using torch.compile, Flash Attention 2, or PyTorch SDPA (scaled dot-product attention) where applicable. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) Deployment & integration: the model is provided on Hugging Face and consumes the AutoModelForSpeechSeq2Seq + AutoProcessor APIs (Transformers). Long-form audio is handled via sequential (sliding-window) or chunked algorithms; chunking favors latency while sequential favors marginally higher accuracy. The weights are available in safetensors for safe loading/quantization. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md))

Key Features

  • Pruned decoder: 4-layer decoder (vs 32) for substantially faster autoregressive decoding.
  • Multilingual ASR: supports ~99 languages with automatic language detection.
  • Transformers-ready: direct AutoModelForSpeechSeq2Seq + AutoProcessor integration.
  • Timestamps: supports sentence- and word-level timestamps for subtitle generation.
  • Performance opt-in: compatible with torch.compile, Flash Attention 2, and SDPA speedups.

Example Usage

Example (python):

import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

# Choose device and dtype
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-large-v3-turbo"

# Load model + processor (safetensors compatible)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
    torch_dtype=torch_dtype,
    low_cpu_mem_usage=True,
    use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

# Create a pipeline for transcription
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    chunk_length_s=30,        # use chunked long-form for lower latency
    device=0 if torch.cuda.is_available() else -1,
)

# Transcribe a file
result = pipe("example_audio.mp3", return_timestamps=True)
print(result["text"])

Benchmarks

Decoder layers: Reduced from 32 to 4 (pruned & finetuned) (Source: https://huggingface.co/openai/whisper-large-v3-turbo (model README) - GitHub discussion for release details.)

Parameter count: ≈809M parameters (Source: https://huggingface.co/openai/whisper-large-v3-turbo (model README).)

Checkpoint size: ~1.62 GB (safetensors) (Source: https://huggingface.co/openai/whisper-large-v3-turbo (files listing).)

Community benchmark (example) — Real-Time Factor (RTF) and WER: RTF ~0.0203, WER ≈0.2012 on a 5-hour test (large-v3-turbo vs other Whisper variants) (Source: Community benchmark repository (ChocolateMagnate) — GitHub speech-to-text-benchmarks.)

Vendor claim — inference speed: Reported up to 216× real-time factor on GroqCloud (hardware-specific claim) (Source: Groq blog announcement: Groq support for whisper-large-v3-turbo.)

Last Refreshed: 2026-03-03

Key Information

  • Category: Audio Models
  • Type: AI Audio Models Tool