OpenVoice - AI Audio Models Tool

Overview

OpenVoice is an open-source instant voice-cloning framework developed by researchers at MIT, Tsinghua University and MyShell.ai. It reproduces a reference speaker’s tone color from only a short audio clip and can render arbitrary text in that voice while providing fine-grained control over style attributes such as emotion, accent, rhythm, pauses and intonation. The project was published as a technical report (arXiv) and the implementation and checkpoints are available under an MIT license; the repository and model card emphasize zero-shot cross-lingual cloning and a modular, production-ready pipeline. ([arxiv.org](https://arxiv.org/abs/2312.01479)) OpenVoice uses a decoupled two-stage approach (a controllable base TTS + a tone-color converter) so style, language and timbre are modeled separately. OpenVoice V2 (released April 2024) increases audio quality and adds native support for several languages (English, Spanish, French, Chinese, Japanese, Korean) while keeping the project MIT-licensed for commercial use. The codebase includes demo notebooks, a local Gradio demo, and example integrations (MyShell.ai demos and Hugging Face model card). ([github.com](https://github.com/myshell-ai/OpenVoice))

Model Statistics

  • Likes: 488
  • Pipeline: text-to-speech

License: mit

Model Details

Architecture: OpenVoice separates generation into (1) a base speaker TTS that controls language and high-level style, and (2) a tone-color converter that transfers the reference speaker’s timbre onto the TTS output. The tone-color converter is implemented with an encoder-decoder design that uses normalizing flow layers to extract and inject timbre while preserving style attributes. The pipeline relies on a language-agnostic phoneme representation (IPA) to enable robust cross-lingual mapping and zero-shot voice cloning. ([arxiv.org](https://arxiv.org/abs/2312.01479)) Base TTS & tooling: V2 integrates with MeloTTS as the base multi-lingual TTS in the public demos; users can swap or fine-tune different base-speaker TTS models. The repository exposes utilities to extract a speaker embedding (tone-color / SE), run the tone-color conversion, and synthesize output files via provided APIs (e.g., ToneColorConverter, se_extractor, MeloTTS TTS wrappers). Example demo notebooks show end-to-end steps: extract SE from reference audio, synthesize base-speaker audio for target text, then run converter.convert(...) to produce the final waveform. ([raw.githubusercontent.com](https://raw.githubusercontent.com/myshell-ai/OpenVoice/main/demo_part3.ipynb)) Performance & deployment: OpenVoice is designed as a feed-forward, non-autoregressive pipeline for efficient inference. The authors and community reports describe fast inference (published qualitative and reported numeric speedups vs. autoregressive systems); the project provides ready-to-run inference scripts and instructions for running locally or in server deployments. The code and checkpoints are available for local or cloud GPU inference. ([arxiv.org](https://arxiv.org/abs/2312.01479))

Key Features

  • Instant cloning from a short reference clip without per-voice retraining.
  • Granular style control: emotion, accent, rhythm, pauses and intonation.
  • Zero-shot cross-lingual cloning using IPA-based language-agnostic content representation.
  • Native multi-lingual V2 support: English, Spanish, French, Chinese, Japanese, Korean.
  • Open-source MIT license with checkpoints and demo notebooks for local deployment.

Example Usage

Example (python):

import torch
from openvoice import se_extractor
from openvoice.api import ToneColorConverter
from melo.api import TTS

# initialize
device = "cuda:0" if torch.cuda.is_available() else "cpu"
converter_ckpt = 'checkpoints_v2/converter'
tone_color_converter = ToneColorConverter(f'{converter_ckpt}/config.json', device=device)
tone_color_converter.load_ckpt(f'{converter_ckpt}/checkpoint.pth')

# extract tone-color embedding from a short reference file
reference_path = 'resources/example_reference.mp3'
target_se, _ = se_extractor.get_se(reference_path, tone_color_converter, vad=True)

# synthesize base-speaker audio (MeloTTS) and convert timbre
tts = TTS(language='EN', device=device)
base_out = 'tmp_base.wav'
tts.tts_to_file("Hello world — this is a test.", speaker_id=0, file=base_out, speed=1.0)
final_out = 'final_cloned.wav'

tone_color_converter.convert(audio_src_path=base_out, src_se=None, tgt_se=target_se, output_path=final_out, message='@OpenVoice')

print('Saved cloned output to', final_out)

# Notes: adjust paths, speaker IDs, and check the repository demo notebooks for full examples and multi-lingual usage.

Pricing

Free — OpenVoice V1 and V2 are published under the MIT license; code and checkpoints are available for commercial and research use. See the GitHub repository for license details.

Benchmarks

GitHub stars: ~36k stars (main repository) (Source: https://github.com/myshell-ai/OpenVoice)

Hugging Face model likes: 488 likes (model card) (Source: https://huggingface.co/myshell-ai/OpenVoice)

ArXiv (paper) last revised: Last revised August 18, 2024 (v6) (Source: https://arxiv.org/abs/2312.01479)

Reported inference speed (optimized): Approximately 12× real-time (reported, A10G example; ~85 ms per second of audio) — community/analysis report (Source: https://www.emergentmind.com/articles/2312.01479)

Production usage (MyShell.ai): Reported powering millions of users via MyShell.ai (paper reports >2M users prior to public release) (Source: https://arxiv.org/abs/2312.01479)

Last Refreshed: 2026-03-03

Key Information

  • Category: Audio Models
  • Type: AI Audio Models Tool