WhisperX - AI Audio Tools Tool
Overview
WhisperX is an open-source enhancement layer around OpenAI’s Whisper ASR models that focuses on time-accurate, long-form transcription with word-level timestamps and optional speaker diarization. The project runs Whisper for transcript generation, then applies a phoneme-based forced-alignment stage (typically wav2vec2-based) to produce accurate per-word start/end times — and integrates voice-activity detection (VAD) and batched inference to scale to long audio efficiently. According to the project's GitHub repository, WhisperX supports batched Whisper inference (reported up to ~60–70x real-time with large-v2 when using faster-whisper), produces SRT-style subtitle outputs with word highlighting, and can call pyannote for speaker diarization (Hugging Face token required). ([github.com](https://github.com/m-bain/whisperX)) Designed for both CLI and Python workflows, WhisperX exposes a one-command CLI (whisperx path/to/audio.wav) and a Python API for load_model/transcribe/align/assign_word_speakers flows. It is intended for research and production use cases (meeting transcription, podcast captioning, searchable archives) where precise timestamps and speaker labels matter. The project is released under a permissive BSD-2-Clause license and is actively maintained with an associated INTERSPEECH paper and demonstration results (paper and challenge placements are linked from the repo). ([github.com](https://github.com/m-bain/whisperX))
GitHub Statistics
- Stars: 20,417
- Forks: 2,161
- Contributors: 105
- License: BSD-2-Clause
- Primary Language: Python
- Last Updated: 2026-02-22T16:12:55Z
- Latest Release: v3.8.1
Activity & community: WhisperX is a high-profile open-source project with strong community adoption. The repository shows ~20.4k stars and ~2.2k forks on GitHub (BSD-2-Clause license). The project has several hundred commits and an active issue/PR backlog (issues and PR counts visible on the repo), and the README references research publications and challenge results, demonstrating a link between academic work and open-source engineering. ([github.com](https://github.com/m-bain/whisperX)) Contributor & maintenance notes: repository metadata (provided with this profile) reports ~105 contributors and the most recent commit on 2026-02-22T16:12:55Z, indicating ongoing maintenance and community contributions. The repo’s issues and discussions are the primary support channels; expect active issue threads, occasional breaking changes in development branches, and a stable PyPI release for production installs. (See the repository for current issue/PR counts and recent commits.)
Installation
Install via pip:
pip install whisperxpip install --upgrade torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 # (install matching CUDA version; check pytorch.org for current command)sudo apt-get install ffmpeg # system dependency for audio I/O on Linuxuvx whisperx # alternative install if using uvx/astral tooling; or uvx git+https://github.com/m-bain/whisperX.git for source Key Features
- Batched Whisper inference to scale long-form audio; reported ~60–70x real-time with large-v2 using faster-whisper. ([github.com](https://github.com/m-bain/whisperX))
- Accurate word-level timestamps via phoneme-based forced alignment (wav2vec2 alignment). ([github.com](https://github.com/m-bain/whisperX))
- Optional speaker diarization integration using pyannote-audio (requires Hugging Face token). ([github.com](https://github.com/m-bain/whisperX))
- VAD-based pre-segmentation to reduce hallucination and enable efficient batching. ([github.com](https://github.com/m-bain/whisperX))
- CLI and Python APIs for transcribe → align → diarize → assign_word_speakers workflows. ([github.com](https://github.com/m-bain/whisperX))
- Flexible compute types (float16/int8) and model-size trade-offs to reduce GPU memory use. ([github.com](https://github.com/m-bain/whisperX))
- Outputs suitable for captioning workflows (SRT with highlighted words) and programmatic JSON segment outputs. ([github.com](https://github.com/m-bain/whisperX))
Community
WhisperX has an active open-source community and is widely used for captioning, meeting transcription, and research. GitHub stars and forks indicate broad adoption; issues and PRs show active maintenance and community contributions. Community feedback highlights: reliable improvements to Whisper timestamps and fast throughput (praised in tutorials and blogs), while common caveats noted by users include imperfect handling of overlapping speech, language-dependent alignment model requirements, and occasional complexity in setting up diarization (Hugging Face model/agreeement and PyTorch/CUDA requirements). For troubleshooting and model questions, the repo issues/discussions and linked docs (README + paper) are the primary resources. ([github.com](https://github.com/m-bain/whisperX))
Key Information
- Category: Audio Tools
- Type: AI Audio Tools Tool