Simple and lightweight Zero-shot Text-to-Speech (TTS) synthesis model. It leverages self-supervised learning (SSL) features and kNN retrieval methods to achieve robust zero-shot synthesis, matching the performance of state-of-the-art models that are more complex and trained on much larger datasets of transcribed speech. The kNN-TTS framework's low training requirements make it suitable for developing zero-shot multi-speaker models in low-resource settings. Additionally, it offers voice morphing capabilities with precise control over the output using an interpolation parameter, allowing for seamless blending of source and target speech styles.
Install Poetry:
pip install poetry
Clone the repo and install the dependencies:
git clone [email protected]:idiap/knn-tts.git
cd knn-tts
poetry install
Example:
from huggingface_hub import snapshot_download
from knn_tts.synthesizer import Synthesizer
from knn_tts.utils import get_vocoder_checkpoint_path
CHECKPOINTS_DIR = "checkpoints"
# Download and Get Path to Model Checkpoints
tts_checkpoints_dir = snapshot_download(repo_id="idiap/knn-tts", local_dir=CHECKPOINTS_DIR)
vocoder_checkpoint_path = get_vocoder_checkpoint_path(CHECKPOINTS_DIR)
# Load Synthesizer
tts_checkpoint_name = "best_model_646135.pth"
synthesizer = Synthesizer(tts_checkpoints_dir, tts_checkpoint_name, vocoder_checkpoint_path, model_name="glowtts")
# Synthesis Inputs
target_style_feats_path = "/path/to/extracted/wavlm/feats/"
text_input = "I think foosball is a combination of football and shish kebabs."
lambda_rate = 1.
# Synthesis Inference
wav = synthesizer(text_input, target_style_feats_path, interpolation_rate=lambda_rate)
## or to write the output directly to a file
output_path = "output.wav"
synthesizer(text_input, target_style_feats_path, interpolation_rate=lambda_rate, save_path=output_path)
LJSpeech example:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2
TTS Data Preprocessing Example:
poetry run python scripts/preprocess_tts_data.py 16000 /path/to/dataset /processed/dataset/output/path
SSL Features Extraction Example:
poetry run python scripts/extract_dataset_embeddings.py wavlm /processed/dataset/output/path /processed/dataset/output/path
Update the following paths in the training recipe:
DATASETS_PATH = "knn-tts/datasets/ljspeech_ssl" # Path to preprocessed LJSpeech dataset
OUTPUT_PATH = f"knn-tts/outputs/glow_tts_ssl/{SSL_MODEL}/ljspeech" # Desired output path
The training dataset should be preprocessed with the scripts mentioned above. The expected dataset structure is as follows:
ljspeech_ssl
├── metadata.csv # standard LJSpeech format (filename|text|normalized_text)
├── wavs # contains all audio files
└── wavlm # contains wavlm features extracted from the audio files, maintaining the same filenames
Launch Training:
poetry run python recipes/train_ljspeech_glow_tts_ssl.py
To continue a previous training run:
poetry run python recipes/train_ljspeech_glow_tts_ssl.py --continue_path /path/to/saved/training/run
To monitor a training run:
poetry run tensorboard --logdir=/path/to/training/run
We would like to thanks the authors of the following repos, from which we have adapted the corresponding parts of our codebase:
- HiFiGAN: https://github.com/jik876/hifi-gan
- WavLM: https://github.com/microsoft/unilm/tree/master/wavlm
- kNN-VC: https://github.com/bshall/knn-vc
@misc{hajal2025knntts,
title={kNN Retrieval for Simple and Effective Zero-Shot Multi-speaker Text-to-Speech},
author={Karl El Hajal and Ajinkya Kulkarni and Enno Hermann and Mathew Magimai.-Doss},
year={2025},
eprint={2408.10771},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2408.10771},
}