openai/whisper-large-v3-turbo - AI Audio Models Tool
Overview
Whisper-large-v3-turbo is an optimized, finetuned and pruned variant of OpenAI's Whisper large-v3 designed for much faster automatic speech recognition (ASR) with only a small quality trade-off. The model reduces the autoregressive decoder depth from 32 layers to 4, keeping the original encoder capacity while dramatically lowering decoding latency; it remains multilingual (≈99 languages) and is distributed under an MIT license on the Hugging Face Hub. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) The release targets real-world transcription workflows that need near–real-time throughput (streaming, batch subtitling, large-volume transcription) while preserving large-v3 quality for many high-resource languages. It integrates directly with Hugging Face Transformers (pipeline and model + processor APIs), supports sentence- and word-level timestamps, language forcing, temperature-fallback decoding strategies, and speedups via torch.compile and Flash-Attention / scaled-dot-product attention where available. The model file is provided in safetensors and is sized to be GPU-friendly for inference and quantization. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md))
Model Statistics
- Downloads: 4,015,573
- Likes: 2849
- Pipeline: automatic-speech-recognition
License: mit
Model Details
Architecture and sizing: whisper-large-v3-turbo is an encoder–decoder (sequence-to-sequence) Transformer derived from OpenAI's Whisper large family. The turbo variant keeps a deep encoder but prunes the decoder from 32 layers down to 4, yielding an approximate parameter count of 809 million and a safetensors checkpoint around 1.62 GB. This design prioritizes parallelizable encoder work and a shallow, “fat” decoder to accelerate autoregressive decoding. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) Capabilities and behavior: the model performs multilingual speech recognition across ~99 languages, automatic language identification, sentence- and word-level timestamps, and supports both transcription and (limited) translation tasks — note that turbo was fine-tuned on transcription-only data and may show reduced translation quality versus the original large-v3. Decoding features supported in the Transformers integration include temperature fallback, condition_on_prev_tokens toggles, logprob/no_speech thresholds, and explicit language/task arguments. Performance can be further improved using torch.compile, Flash Attention 2, or PyTorch SDPA (scaled dot-product attention) where applicable. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md)) Deployment & integration: the model is provided on Hugging Face and consumes the AutoModelForSpeechSeq2Seq + AutoProcessor APIs (Transformers). Long-form audio is handled via sequential (sliding-window) or chunked algorithms; chunking favors latency while sequential favors marginally higher accuracy. The weights are available in safetensors for safe loading/quantization. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3-turbo/raw/main/README.md))
Key Features
- Pruned decoder: 4-layer decoder (vs 32) for substantially faster autoregressive decoding.
- Multilingual ASR: supports ~99 languages with automatic language detection.
- Transformers-ready: direct AutoModelForSpeechSeq2Seq + AutoProcessor integration.
- Timestamps: supports sentence- and word-level timestamps for subtitle generation.
- Performance opt-in: compatible with torch.compile, Flash Attention 2, and SDPA speedups.
Example Usage
Example (python):
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
# Choose device and dtype
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "openai/whisper-large-v3-turbo"
# Load model + processor (safetensors compatible)
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype,
low_cpu_mem_usage=True,
use_safetensors=True,
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
# Create a pipeline for transcription
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
chunk_length_s=30, # use chunked long-form for lower latency
device=0 if torch.cuda.is_available() else -1,
)
# Transcribe a file
result = pipe("example_audio.mp3", return_timestamps=True)
print(result["text"]) Benchmarks
Decoder layers: Reduced from 32 to 4 (pruned & finetuned) (Source: https://huggingface.co/openai/whisper-large-v3-turbo (model README) - GitHub discussion for release details.)
Parameter count: ≈809M parameters (Source: https://huggingface.co/openai/whisper-large-v3-turbo (model README).)
Checkpoint size: ~1.62 GB (safetensors) (Source: https://huggingface.co/openai/whisper-large-v3-turbo (files listing).)
Community benchmark (example) — Real-Time Factor (RTF) and WER: RTF ~0.0203, WER ≈0.2012 on a 5-hour test (large-v3-turbo vs other Whisper variants) (Source: Community benchmark repository (ChocolateMagnate) — GitHub speech-to-text-benchmarks.)
Vendor claim — inference speed: Reported up to 216× real-time factor on GroqCloud (hardware-specific claim) (Source: Groq blog announcement: Groq support for whisper-large-v3-turbo.)
Key Information
- Category: Audio Models
- Type: AI Audio Models Tool