CSM

This repo contains the code for fine-tuning and using the CSM model.

CSM (Conversational Speech Model) is a speech generation model from Sesame that generates RVQ audio codes from text and audio inputs. The model architecture employs a Llama backbone and a smaller audio decoder that produces Mimi audio codes.

Requirements

A CUDA-compatible GPU
The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions
Similarly, Python 3.10 is recommended, but newer versions may be fine
For some audio operations, ffmpeg may be required
Access to the following Hugging Face models:
- Llama-3.2-1B
- CSM-1B (for English generation)
- yycc/csm-1b-chinese (used in the provided training script)

Setup

git clone [email protected]:SesameAILabs/csm.git
cd csm
python3.10 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# You will need access to CSM-1B and Llama-3.2-1B
huggingface-cli login

Windows Setup

The triton package cannot be installed in Windows. Instead use pip install triton-windows.

Usage

Generate a sentence

from generator import load_csm_1b
import torchaudio
import torch

if torch.backends.mps.is_available():
    device = "mps"
elif torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"

generator = load_csm_1b(device=device)

audio = generator.generate(
    text="你好, 我是sesame,我会说字正腔圆的普通话",
    speaker=0,
    context=[],
    max_audio_length_ms=10_000,
)

torchaudio.save("audio.wav", audio.unsqueeze(0).cpu(), generator.sample_rate)

Training

The main.py script provides an example of how to fine-tune the CSM model. It is currently configured to train on a Chinese dataset (EmiliaIterableDataset) using the yycc/csm-1b-chinese checkpoint as a base.

# Adjust parameters like NUM_GRAD_ACCUM, LR, batch_size in main.py as needed
torchrun  --nproc_per_node=4 main.py

FAQ

Does this model come with any voices?

The model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.

Can I converse with the model?

CSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.

Does it support other languages?

The base sesame/csm-1b model has some capacity for non-English languages due to data contamination in the training data, but performance may vary.

The provided main.py script demonstrates fine-tuning for Chinese using the yycc/csm-1b-chinese checkpoint and the EmiliaIterableDataset. You can adapt this script for other languages and datasets.

Misuse and abuse ⚠️

This project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we explicitly prohibit the following:

Impersonation or Fraud: Do not use this model to generate speech that mimics real individuals without their explicit consent.
Misinformation or Deception: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.
Illegal or Harmful Activities: Do not use this model for any illegal, harmful, or malicious purposes.

By using this model, you agree to comply with all applicable laws and ethical guidelines. We are not responsible for any misuse, and we strongly condemn unethical applications of this technology.

Authors

Johan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
data.py		data.py
generator.py		generator.py
main.py		main.py
models.py		models.py
utils.py		utils.py
watermarking.py		watermarking.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSM

Requirements

Setup

Windows Setup

Usage

Training

FAQ

Misuse and abuse ⚠️

Authors

About

Releases

Packages

Languages

chenyuntc/csm-training

Folders and files

Latest commit

History

Repository files navigation

CSM

Requirements

Setup

Windows Setup

Usage

Training

FAQ

Misuse and abuse ⚠️

Authors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages