WhisperX - AI Audio Tools Tool

Overview

WhisperX is an open-source enhancement layer around OpenAI’s Whisper ASR models that focuses on time-accurate, long-form transcription with word-level timestamps and optional speaker diarization. The project runs Whisper for transcript generation, then applies a phoneme-based forced-alignment stage (typically wav2vec2-based) to produce accurate per-word start/end times — and integrates voice-activity detection (VAD) and batched inference to scale to long audio efficiently. According to the project's GitHub repository, WhisperX supports batched Whisper inference (reported up to ~60–70x real-time with large-v2 when using faster-whisper), produces SRT-style subtitle outputs with word highlighting, and can call pyannote for speaker diarization (Hugging Face token required). ([github.com](https://github.com/m-bain/whisperX)) Designed for both CLI and Python workflows, WhisperX exposes a one-command CLI (whisperx path/to/audio.wav) and a Python API for load_model/transcribe/align/assign_word_speakers flows. It is intended for research and production use cases (meeting transcription, podcast captioning, searchable archives) where precise timestamps and speaker labels matter. The project is released under a permissive BSD-2-Clause license and is actively maintained with an associated INTERSPEECH paper and demonstration results (paper and challenge placements are linked from the repo). ([github.com](https://github.com/m-bain/whisperX))

GitHub Statistics

  • Stars: 20,417
  • Forks: 2,161
  • Contributors: 105
  • License: BSD-2-Clause
  • Primary Language: Python
  • Last Updated: 2026-02-22T16:12:55Z
  • Latest Release: v3.8.1

Activity & community: WhisperX is a high-profile open-source project with strong community adoption. The repository shows ~20.4k stars and ~2.2k forks on GitHub (BSD-2-Clause license). The project has several hundred commits and an active issue/PR backlog (issues and PR counts visible on the repo), and the README references research publications and challenge results, demonstrating a link between academic work and open-source engineering. ([github.com](https://github.com/m-bain/whisperX)) Contributor & maintenance notes: repository metadata (provided with this profile) reports ~105 contributors and the most recent commit on 2026-02-22T16:12:55Z, indicating ongoing maintenance and community contributions. The repo’s issues and discussions are the primary support channels; expect active issue threads, occasional breaking changes in development branches, and a stable PyPI release for production installs. (See the repository for current issue/PR counts and recent commits.)

Installation

Install via pip:

pip install whisperx
pip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121  # (install matching CUDA version; check pytorch.org for current command)
sudo apt-get install ffmpeg  # system dependency for audio I/O on Linux
uvx whisperx  # alternative install if using uvx/astral tooling; or uvx git+https://github.com/m-bain/whisperX.git for source

Key Features

  • Batched Whisper inference to scale long-form audio; reported ~60–70x real-time with large-v2 using faster-whisper. ([github.com](https://github.com/m-bain/whisperX))
  • Accurate word-level timestamps via phoneme-based forced alignment (wav2vec2 alignment). ([github.com](https://github.com/m-bain/whisperX))
  • Optional speaker diarization integration using pyannote-audio (requires Hugging Face token). ([github.com](https://github.com/m-bain/whisperX))
  • VAD-based pre-segmentation to reduce hallucination and enable efficient batching. ([github.com](https://github.com/m-bain/whisperX))
  • CLI and Python APIs for transcribe → align → diarize → assign_word_speakers workflows. ([github.com](https://github.com/m-bain/whisperX))
  • Flexible compute types (float16/int8) and model-size trade-offs to reduce GPU memory use. ([github.com](https://github.com/m-bain/whisperX))
  • Outputs suitable for captioning workflows (SRT with highlighted words) and programmatic JSON segment outputs. ([github.com](https://github.com/m-bain/whisperX))

Community

WhisperX has an active open-source community and is widely used for captioning, meeting transcription, and research. GitHub stars and forks indicate broad adoption; issues and PRs show active maintenance and community contributions. Community feedback highlights: reliable improvements to Whisper timestamps and fast throughput (praised in tutorials and blogs), while common caveats noted by users include imperfect handling of overlapping speech, language-dependent alignment model requirements, and occasional complexity in setting up diarization (Hugging Face model/agreeement and PyTorch/CUDA requirements). For troubleshooting and model questions, the repo issues/discussions and linked docs (README + paper) are the primary resources. ([github.com](https://github.com/m-bain/whisperX))

Last Refreshed: 2026-03-03

Key Information

  • Category: Audio Tools
  • Type: AI Audio Tools Tool