openai/whisper-large-v3-turbo - AI Audio Models Tool

Overview

openai/whisper-large-v3-turbo is a pruned, fine-tuned variant of OpenAI's Whisper large-v3 designed for fast, high-quality automatic speech recognition (ASR) and speech translation. By substantially reducing the number of decoder layers (from 32 to 4), the model achieves much faster inference with only a modest drop in transcription quality compared to the full large-v3. It is distributed under the MIT license and integrated with Hugging Face Transformers for straightforward use in transcription and translation pipelines. The turbo variant supports 99 languages and is targeted at real-world use cases where latency and compute cost matter—such as interactive transcription, batch processing of long audio, and embedded or edge inference scenarios. According to the model page, it is packaged as an automatic-speech-recognition pipeline-ready model and has substantial community adoption on Hugging Face (downloads and likes listed on the model page). For teams wanting a balance of speed and quality, whisper-large-v3-turbo is a practical alternative to the full-sized model.

Model Statistics

  • Downloads: 2,649,818
  • Likes: 2765
  • Pipeline: automatic-speech-recognition
  • Parameters: 808.9M

License: mit

Model Details

Architecture and provenance: whisper-large-v3-turbo is derived from OpenAI's Whisper large-v3 encoder-decoder Transformer architecture. The model retains the original Whisper encoder design but is pruned and fine-tuned, with the decoder depth reduced to 4 layers (down from 32) to accelerate generation. Size and compatibility: The model has roughly 808.9 million parameters and is released under the MIT license. It is published on Hugging Face and compatible with the Transformers automatic-speech-recognition pipeline, making it easy to drop into existing ASR workflows. Capabilities include direct speech-to-text transcription and speech-to-text translation across 99 languages supported by the Whisper family. The turbo pruning trades a small amount of accuracy for large reductions in inference time and compute cost, making it suitable for latency-sensitive deployments.

Key Features

  • Pruned decoder (32→4 layers) for much faster inference and lower latency
  • Automatic speech recognition across 99 languages
  • Speech-to-text translation ability (speech translation to target language)
  • Compatible with Hugging Face Transformers pipeline for plug-and-play use
  • Distributed under the permissive MIT license for broad reuse

Example Usage

Example (python):

from transformers import pipeline

# Basic transcription example using the Hugging Face pipeline
asr = pipeline(
    task="automatic-speech-recognition",
    model="openai/whisper-large-v3-turbo",
    device=0  # set to -1 for CPU
)

# Transcribe a local file (WAV/MP3/FLAC supported)
result = asr("path/to/audio_file.wav")
print("Transcribed text:", result.get("text"))

# Example: translate speech to English (models in the Whisper family can be used for translation)
# Some deployments require passing translation generation kwargs; behavior may vary by Transformers version.
translated = asr("path/to/foreign_language_audio.wav", generate_kwargs={"task": "translate"})
print("Translated text:", translated.get("text"))

# For advanced usage (chunking, timestamps, custom sampling), load processor/model directly from Transformers.

Benchmarks

Hugging Face downloads: 2,649,818 (Source: https://huggingface.co/openai/whisper-large-v3-turbo)

Hugging Face likes: 2,765 (Source: https://huggingface.co/openai/whisper-large-v3-turbo)

Parameters: 808.9M (Source: https://huggingface.co/openai/whisper-large-v3-turbo)

Supported languages: 99 (Source: https://huggingface.co/openai/whisper-large-v3-turbo)

Pipeline: automatic-speech-recognition (Source: https://huggingface.co/openai/whisper-large-v3-turbo)

Last Refreshed: 2026-01-09

Key Information

  • Category: Audio Models
  • Type: AI Audio Models Tool