kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Simple and lightweight Zero-shot Text-to-Speech (TTS) synthesis model. It leverages self-supervised learning (SSL) features and kNN retrieval methods to achieve robust zero-shot synthesis, matching the performance of state-of-the-art models that are more complex and trained on much larger datasets of transcribed speech. The kNN-TTS framework's low training requirements make it suitable for developing zero-shot multi-speaker models in low-resource settings. Additionally, it offers voice morphing capabilities with precise control over the output using an interpolation parameter, allowing for seamless blending of source and target speech styles.

Installation

Install Poetry:

pip install poetry

Clone the repo and install the dependencies:

git clone [email protected]:idiap/knn-tts.git
cd knn-tts
poetry install

Synthesis

Example:

from huggingface_hub import snapshot_download

from knn_tts.synthesizer import Synthesizer
from knn_tts.utils import get_vocoder_checkpoint_path

CHECKPOINTS_DIR = "checkpoints"

# Download and Get Path to Model Checkpoints
tts_checkpoints_dir = snapshot_download(repo_id="idiap/knn-tts", local_dir=CHECKPOINTS_DIR)
vocoder_checkpoint_path = get_vocoder_checkpoint_path(CHECKPOINTS_DIR)

# Load Synthesizer
tts_checkpoint_name = "best_model_646135.pth"
synthesizer = Synthesizer(tts_checkpoints_dir, tts_checkpoint_name, vocoder_checkpoint_path, model_name="glowtts")

# Synthesis Inputs
target_style_feats_path = "/path/to/extracted/wavlm/feats/"
text_input = "I think foosball is a combination of football and shish kebabs."
lambda_rate = 1.

# Synthesis Inference
wav = synthesizer(text_input, target_style_feats_path, interpolation_rate=lambda_rate)

## or to write the output directly to a file
output_path = "output.wav"
synthesizer(text_input, target_style_feats_path, interpolation_rate=lambda_rate, save_path=output_path)

Training

Download training dataset

LJSpeech example:

wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2

Data preprocessing

TTS Data Preprocessing Example:

poetry run python scripts/preprocess_tts_data.py 16000 /path/to/dataset /processed/dataset/output/path

SSL Features Extraction Example:

poetry run python scripts/extract_dataset_embeddings.py wavlm /processed/dataset/output/path /processed/dataset/output/path

GlowTTS-SSL Training

Update the following paths in the training recipe:

DATASETS_PATH = "knn-tts/datasets/ljspeech_ssl" # Path to preprocessed LJSpeech dataset
OUTPUT_PATH = f"knn-tts/outputs/glow_tts_ssl/{SSL_MODEL}/ljspeech" # Desired output path

The training dataset should be preprocessed with the scripts mentioned above. The expected dataset structure is as follows:

ljspeech_ssl
├── metadata.csv    # standard LJSpeech format (filename|text|normalized_text)
├── wavs            # contains all audio files
└── wavlm           # contains wavlm features extracted from the audio files, maintaining the same filenames

Launch Training:

poetry run python recipes/train_ljspeech_glow_tts_ssl.py

To continue a previous training run:

poetry run python recipes/train_ljspeech_glow_tts_ssl.py --continue_path /path/to/saved/training/run

To monitor a training run:

poetry run tensorboard --logdir=/path/to/training/run

Acknowledgements

We would like to thanks the authors of the following repos, from which we have adapted the corresponding parts of our codebase:

HiFiGAN: https://github.com/jik876/hifi-gan
WavLM: https://github.com/microsoft/unilm/tree/master/wavlm
kNN-VC: https://github.com/bshall/knn-vc

Citation

@inproceedings{hajal-etal-2025-knn,
    title = "k{NN} Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech",
    author = "Hajal, Karl El  and
      Kulkarni, Ajinkya  and
      Hermann, Enno  and
      Magimai Doss, Mathew",
    editor = "Chiruzzo, Luis  and
      Ritter, Alan  and
      Wang, Lu",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-short.65/",
    pages = "778--786",
    ISBN = "979-8-89176-190-2"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
LICENSES		LICENSES
assets		assets
knn_tts		knn_tts
recipes		recipes
scripts		scripts
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
README.md		README.md
REUSE.toml		REUSE.toml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Table of Contents

Installation

Synthesis

Training

Download training dataset

Data preprocessing

GlowTTS-SSL Training

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

idiap/knn-tts

Folders and files

Latest commit

History

Repository files navigation

kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech

Table of Contents

Installation

Synthesis

Training

Download training dataset

Data preprocessing

GlowTTS-SSL Training

Acknowledgements

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages