Whisper Large v3 - AI Audio Models Tool
Overview
Whisper Large v3 is OpenAI’s largest open-source automatic speech recognition (ASR) and speech-translation checkpoint, trained on a mixture of weakly labeled and pseudo-labeled audio totalling more than five million hours. The model is multilingual (reported support for ~99 languages) and is built to generalize in zero‑shot settings across accents, background noise, and diverse recording conditions. According to the model card and OpenAI release notes, large‑v3 keeps the same encoder–decoder Transformer architecture as prior large checkpoints but adopts higher-resolution input features and expanded training data to reduce errors across many languages. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3)) Whisper large‑v3 is distributed on the Hugging Face Hub under an open license and is widely used both as a direct ASR/translation model and as a base for distilled/faster variants (for example large‑v3‑turbo and community distillations). The model supports sentence‑ and word‑level timestamps, automatic language detection, and translation-to-English (task="translate") in the same model. Community and research benchmarks report consistent accuracy gains versus large‑v2, while various turbo/distilled builds trade a small accuracy drop for large speed and memory improvements. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3))
Model Statistics
- Downloads: 6,066,762
- Likes: 5437
- Pipeline: automatic-speech-recognition
License: apache-2.0
Model Details
Architecture & inputs: Whisper large‑v3 uses the same encoder–decoder Transformer design as previous large checkpoints. The input spectrogram uses 128 Mel frequency bins (up from 80), and the model natively processes ~30‑second audio receptive fields; longer audio is handled via sequential or chunked long‑form algorithms. The checkpoint performs both transcription (same-language output) and translation (non‑English → English) via special task tokens. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3)) Training & scale: OpenAI reports the v3 checkpoint was trained on a mixture composed of ~1 million hours of weakly labeled audio plus ~4 million hours of pseudo‑labeled audio (pseudo-labels generated from large‑v2), trained for ~2.0 epochs over that mixture. Community sources and hub pages list the large‑v3 parameter footprint at ~1.55 billion parameters (the same scale as previous ‘large’ checkpoints), and the model card on Hugging Face lists the license as Apache‑2.0. For long‑form transcription, Hugging Face recommends the sequential sliding‑window algorithm for best accuracy and a chunked algorithm when speed is primary. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3)) Inference & deployment notes: Transformers pipeline support is available (AutoModelForSpeechSeq2Seq + AutoProcessor) with fp16 and low‑cpu‑mem usage options. Performance can be improved with torch.compile (reported ~4.5x speedups on compatible GPUs) and FlashAttention2 where supported. Many community builds provide quantized or pruned/turbo variants (large‑v3‑turbo) for lower VRAM and faster latency. Validate per-language accuracy before production use, and prefer large‑v3 for highest accuracy when compute permits. ([huggingface.co](https://huggingface.co/openai/whisper-large-v3))
Key Features
- Multilingual ASR and speech→English translation in one encoder–decoder model.
- Trained on >5M hours (1M weak + 4M pseudo‑labels) for strong zero‑shot generalization.
- 128‑band Mel‑spectrogram input improves frequency resolution versus earlier checkpoints.
- Built‑in language detection, sentence and optional word‑level timestamps.
- Open checkpoints on Hugging Face; community turbo/distilled variants for lower latency.
Example Usage
Example (python):
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
# Example: load Whisper large-v3 from Hugging Face and transcribe a file
model_id = "openai/whisper-large-v3"
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
device=0 if torch.cuda.is_available() else -1,
)
# Transcribe a local file (automatically detects language and returns text)
result = pipe("./meeting_audio.mp3")
print(result["text"]) # prints transcription
# Notes: see the model card for options like return_timestamps, task='translate', and chunk_length_s.
# (Example adapted from the Hugging Face model card.) Benchmarks
Reported error reduction vs. large‑v2 (aggregate): 10–20% relative reduction in error on Common Voice / Fleurs subsets (Source: https://github.com/openai/whisper/discussions/1762)
Training data size (reported): >5 million hours total (1M weak + 4M pseudo‑labeled) (Source: https://huggingface.co/openai/whisper-large-v3)
Example study WER (selected experiment): Mean WER 9.3% (large‑v3) vs 15.3% (base) — study-specific (accent/accessibility study) (Source: https://nhsjs.com/2025/evaluating-the-accessibility-of-automatic-speech-recognition-technology-across-accents/)
Distilled / latency comparison (community distillation): Distil variants report ~6x faster inference with ≈1% WER gap to large‑v3 on long‑form tasks (Source: https://huggingface.co/distil-whisper/distil-large-v3)
Turbo variant speedup (OpenAI/community reports): large‑v3‑turbo reported ≈8× faster than full large‑v3 with small accuracy tradeoffs (Source: https://github.com/openai/whisper/discussions/2363)
Key Information
- Category: Audio Models
- Type: AI Audio Models Tool