Skip to content

declare-lab/HyperTTS

Repository files navigation

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, ``hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems.

intro-fig

Figure: Comparison of our approach against baselines: Fine-tuning tunes the backbone model parameters on the adaptation dataset. AdapterTTS inserts learnable modules into the backbone. HyperTTS (ours) converts the static adapter modules to dynamic by speaker-conditional sampling using a (learnable) hypernetwork. Both AdapterTTS and HyperTTS keep the backbone model parameters frozen and thus parameter-efficient.

Architecture

Architecture Figure: An overview of the HYPERTTS. SE and LE denote speaker embedding and layer embedding.

We provide checkpoint here:

Pretrained on LTS100 checkpoint: 600000.pth.tar

Pretrain on LTS

CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset LTS

Finetune hyperTTS_all on VCTK or LTS2

# LTS2
CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset LTS2 --restore_step 600000
# VCTK
CUDA_VISIBLE_DEVICES=0 python3 train.py --dataset VCTK --restore_step 600000

Inference

CUDA_VISIBLE_DEVICES=2 python3 synthesize.py --source /data/Dataset/preprocessed_data/VCTK_16k/val_unsup.txt --restore_step 900000 --mode batch --dataset VCTK

Get objective metrics

python object_metrics.py --ref_wav_dir /data/result/LTS100_GT --synth_wav_dir /data/result/LTS100_syn/

Audio Samples

We compare 20 samples and upload the generated audio files to the directory ./Show20Samples

We refer to this repo: Comprehensive-Transformer-TTS

Citation

@inproceedings{li2024hypertts,
      title={HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks}, 
      author={Yingting Li and Rishabh Bhardwaj and Ambuj Mehrish and Bo Cheng and Soujanya Poria},
      year={2024},
      conference={COLING},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages