OpenVoice V2 - AI Audio Models Tool
Overview
OpenVoice V2 is an open-source, research-grade text-to-speech (TTS) and instant voice-cloning system from MyShell.ai and collaborators. It was released as the V2 checkpoint in April 2024 and builds on the original OpenVoice research: it decouples "tone color" (timbre) extraction from style and language generation so a short reference clip can be cloned and used to speak arbitrary text with controlled emotion, rhythm, pauses and intonation. ([arxiv.org](https://arxiv.org/pdf/2312.01479.pdf)) Practically, V2 focuses on higher audio quality and native multilingual support (English, Spanish, French, Chinese, Japanese and Korean). The project is distributed under an MIT license (free for research and commercial use) and is provided both as code + checkpoints on GitHub and as a hosted model card and demo on Hugging Face. For V2 the recommended pipeline uses MeloTTS as a base multi-lingual speaker engine and a separate tone-color converter to produce the final waveform. The Hugging Face model page provides quick-use demos and links to the repo, while the repository contains Jupyter demo notebooks and local Gradio examples. ([huggingface.co](https://huggingface.co/myshell-ai/OpenVoiceV2))
Model Statistics
- Likes: 477
- Pipeline: text-to-speech
License: mit
Model Details
OpenVoice is a modular, feed-forward TTS framework that separates (1) a base-speaker TTS model (controls language and style) from (2) a tone-color converter (applies the reference speaker's timbre). In the public implementation the base TTS is typically a VITS-style model (or MeloTTS for V2), the tone-color converter uses a convolutional encoder + invertible normalizing-flow bottleneck, and HiFi-GAN is used as the vocoder for waveform synthesis. This design enables non-autoregressive, fast inference while preserving controllable style parameters. ([arxiv.org](https://arxiv.org/pdf/2312.01479.pdf)) OpenVoice V2 introduces a modified training strategy to improve audio fidelity over V1 and includes first-class integration with MeloTTS for native multi-lingual base speakers (English variants, Spanish, French, Chinese, Japanese, Korean). The public package and notebooks expose APIs for: extracting speaker embeddings (SE extractor), generating base audio via MeloTTS, and running the tone-color conversion step. Users can run the pipeline locally (requires Python 3.9, CUDA-compatible GPU for reasonable speed) or try pre-deployed language demos linked from the model card. The project code, checkpoints, and academic technical report are available in the GitHub repository and the arXiv technical report. ([github.com](https://github.com/myshell-ai/OpenVoice/blob/main/README.md))
Key Features
- Accurate tone-color (timbre) cloning from a short reference clip.
- Flexible style control (emotion, pauses, rhythm, intonation) via base speaker parameters.
- Zero-shot cross-lingual cloning — clone voice into languages not in the MSML dataset.
- Native multi-lingual support for English, Spanish, French, Chinese, Japanese, Korean.
- Open MIT license — permitted for commercial and research use.
Example Usage
Example (python):
## Minimal local example (conceptual)
# Requirements: cloned OpenVoice repo, V2 checkpoints in checkpoints_v2, MeloTTS installed
# See: https://github.com/myshell-ai/OpenVoice and https://huggingface.co/myshell-ai/OpenVoiceV2
# 1) install MeloTTS (one-time)
# pip install git+https://github.com/myshell-ai/MeloTTS.git
# python -m unidic download
# 2) example pipeline (adapted from demo notebooks)
from melo.api import TTS # MeloTTS base speaker API
from openvoice.api import ToneColorConverter # OpenVoice converter (from repo)
from openvoice import se_extractor # speaker embedding extractor
# initialize TTS engine
tts = TTS() # uses preconfigured MeloTTS checkpoints (see MeloTTS docs)
# 1) extract speaker embedding from a short reference audio
ref_audio = 'reference.wav'
target_se, _ = se_extractor.get_se(ref_audio, vad=False)
# 2) synthesize base speech from text with a chosen MeloTTS speaker/style
text = "Hello — this is a test in English."
base_out = 'base.wav'
# speaker_id selects the MeloTTS base speaker (language/style)
speaker_id = 'en_default'
tts.tts_to_file(text, speaker_id, base_out, language='en')
# 3) run tone-color conversion to apply reference timbre
converter = ToneColorConverter(checkpoint_path='checkpoints_v2/tone_color_converter.pth')
converter.convert(base_out, src_se=None, tgt_se=target_se, out_path='cloned_output.wav')
# Note: function names and parameters mirror the public demo notebooks; consult demo_part3.ipynb for full working examples and exact checkpoint names.
Benchmarks
Release date: April 2024 (Source: https://huggingface.co/myshell-ai/OpenVoiceV2)
License: MIT (free for commercial use) (Source: https://github.com/myshell-ai/OpenVoice)
Supported languages (native in V2): English, Spanish, French, Chinese, Japanese, Korean (Source: https://huggingface.co/myshell-ai/OpenVoiceV2)
Architectural summary: Base TTS (VITS/MeloTTS) + Tone-color converter (conv encoder + normalizing flow) + HiFi-GAN vocoder (Source: https://arxiv.org/pdf/2312.01479.pdf)
GitHub stars (OpenVoice repo): ≈ 36k stars (repo front page) (Source: https://github.com/myshell-ai/OpenVoice)
Hugging Face model likes: 477 likes (model card) (Source: https://huggingface.co/myshell-ai/OpenVoiceV2)
Computational efficiency (author claim): Authors report inference/computational cost 'tens of times' lower than some commercial APIs (Source: https://arxiv.org/pdf/2312.01479.pdf)
Key Information
- Category: Audio Models
- Type: AI Audio Models Tool