Home › Audio Models › CSM (Conversational Speech Model)

CSM (Conversational Speech Model) - AI Audio Models Tool

Overview

CSM (Conversational Speech Model) is an open-source framework from SesameAILabs for generating conversational speech via discrete audio codes. The model converts text and short audio prompts into RVQ (residual vector quantization) audio codes using a Llama-based language backbone and a specialized audio decoder that emits Mimi audio codes. Those discrete codes can then be rendered by downstream vocoders or synthesis pipelines, enabling interactive, multi-turn speech generation that preserves conversational context. CSM is designed for research and integration into applications such as voice assistants, multimodal chatbots, accessibility tools, and synthetic dialogue systems. Because it outputs compact, quantized audio representations rather than raw waveforms, CSM can be combined with different vocoders, codec-aware pipelines, or streaming backends to control fidelity, latency, and bandwidth. The project is released under the Apache-2.0 license and emphasizes reproducible experiments, model interoperability, and extension points for fine-tuning language-conditioned audio behavior.

GitHub Statistics

Stars: 14,426
Forks: 1,462
Contributors: 11
License: Apache-2.0
Primary Language: Python
Last Updated: 2025-05-27T12:21:51Z

According to the GitHub repository, CSM has substantial community interest with 14,426 stars and 1,462 forks. The project lists 11 contributors and is actively maintained—the last commit was recorded on 2025-05-27. The Apache-2.0 license supports commercial and research reuse. High star and fork counts suggest broad adoption and experimentation; however, the relatively small core contributor count indicates community contributions are present but centralized. Active recent commits imply ongoing development and responsiveness to issues and feature requests.

Installation

Install via pip:

git clone https://github.com/SesameAILabs/csm.git

cd csm

pip install -r requirements.txt

pip install -e .

Key Features

Generates RVQ discrete audio codes from text and audio inputs
Uses a Llama backbone for language understanding and context
Specialized audio decoder that produces Mimi audio codes
Designed for interactive, multi-turn conversational speech synthesis
Apache-2.0 open-source license for research and commercial use

Community

The repository has strong visibility (14.4k stars, 1.46k forks) and active maintenance (last commit 2025-05-27). Community contributions exist (11 contributors) and users can engage via GitHub issues, pull requests, and the project's discussion spaces. Adoption appears strong among researchers and developers exploring discrete-code speech synthesis.

Last Refreshed: 2026-01-09

GitHub

Key Information

Category: Audio Models
Type: AI Audio Models Tool

Visit Official Website

CSM (Conversational Speech Model) - AI Audio Models Tool

Overview

GitHub Statistics

Installation

Key Features

Community

Key Information

Related Tools

OpenVoice

Parler-TTS

SpeechBrain

Whisper Large

openai/whisper-large-v3-turbo

OpenVoice V2