FastSpeech 2 ✈️

A novoice's PyTorch implementation of FastSpeech 2: Fast and High-Quality End-to-End Text to Speech based on FastSpeech implementation of Deepest-Project FastSpeech. The quality of voice samples generated by this repo is not upto mark, major reason being the use of batch_size = 8 due to inferior GPU memory and processing power. With batch_size>8 my CUDA memory ran out. I would be glad if anyone reading this repo can take up the training with batch_size as given in the paper and/or suggest ways of improving the results. 😇

Demo

Download the checkpoint from here trained on LJSpeech dataset. Place it in the training_log folder. And run the inference.ipynb. For mel to audio generation I have used MelGan from 🔦 torch hub.

Requirements

All code is writen in python 3.6.10.
requirements.txt contains the list of all packages required to run this repo.

pip install -r requirements.txt

For smooth working download the latest torch and suitable cuda version from here. This repo works with pytorch => 1.4. Not sure about the lower versions, let me know if they work.

Before moving to the next step update the hparams.py file as per your requirements.

Pre-preocessing

The folder MFA_filelist contains pre extracted alignments using Montreal Forced Aligner on the LJSpeech Dataset. For more information on using MFA visit here.

python preprocess.py -d /root_path/to/wavs/
python compute_statistics.py

Update the hparams.py file with appropraite infor about pitch and energy

Train

Make sure you have the training_log folder existing in the repo before running the below command.

python train.py

Tensorboard images

Generated vs Original Mel

Note

The output of the present checkpoint is not good, because of lack of training. Will update with the best checkpoint as soon as I can.
There are outliers in the dataset that needs to be taken care. Hopefully that can make the training more lean.
Using a lower batch size doesnot work well with this model.
Normalizing pitch and energy may also help with faster training or better convergence.
This model was trained on Nvidia GTX GeForce 960M 4gb, which is pretty low standard in comparison to the requirements of this model.
Feel free to share intresting insights.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
MFA_filelist		MFA_filelist
filelists		filelists
img		img
modules		modules
text		text
utils		utils
.gitattributes		.gitattributes
Generate_FileList.ipynb		Generate_FileList.ipynb
README.md		README.md
audio_processing.py		audio_processing.py
compute_statistics.py		compute_statistics.py
fastspeech.py		fastspeech.py
file_prep_MFA.py		file_prep_MFA.py
hparams.py		hparams.py
inference.ipynb		inference.ipynb
layers.py		layers.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
stft.py		stft.py
train.py		train.py
transformer.py		transformer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastSpeech 2 ✈️

Demo

Requirements

Pre-preocessing

Train

Tensorboard images

Generated vs Original Mel

Note

References

Happy Learning! 😄

About

Releases

Packages

Languages

carankt/FastSpeech2

Folders and files

Latest commit

History

Repository files navigation

FastSpeech 2 ✈️

Demo

Requirements

Pre-preocessing

Train

Tensorboard images

Generated vs Original Mel

Note

References

Happy Learning! 😄

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages