Skip to content

This project offers a deeper exploration of tttzof351's "Simple Transformer TTS" codebase, enhanced with insights from Gemini Advanced, Google AI's language model.

License

Notifications You must be signed in to change notification settings

raul23/simple-transformer-tts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Demystifying Transformer TTS with Gemini Advanced

This project offers a deeper exploration of tttzof351's "Simple Transformer TTS" tutorial and code, enhanced with insights from Gemini Advanced, Google AI's language model:

It is a toy implementation of a transformer TTS with these main simplifications:

  • without tokenizer
  • without scaled pos-encoding
  • without vocoder, only Griffin-Lim

The model was trained on the LJ Speech Dataset. The LJ Speech Dataset is a public domain speech dataset consisting of 13,100 short audio clips of a single speaker reading passages from 7 non-fiction books.

Note: Also check the kaggle notebook Simple Transformer Text-to-Speech that is associated with this github repo where I am replicating the "Simple Transformer TTS" tutorial and code, leveraging the power of Google AI's Gemini for conceptual explanations.

Table of Contents

Introduction

Key Features

  • Code Commentary and Docstrings: Gemini has provided extensive comments and docstrings directly within the Python source files of the st_tts package. Explore files like dataset.py, model.py, train.py and others to find in-depth explanations.
  • Module Explanations: Gemini will break down the functionality of TTS system components and their relationships within the codebase, offering a clearer understanding of the system's architecture.
  • Enhanced Learning Experience: Delve into the mechanics of Transformer-based TTS with the support of Gemini's advanced language processing, supplementing the original tutorial.

How It Works

  1. Tutorial and code Foundation: We'll utilize tttzof351's "Simple Transformer TTS" tutorial and code as the foundation for exploration.
  2. Code Integration: Gemini's insights are seamlessly integrated into the Python files (dataset.py, hyperparams.py, model.py, etc.) within the st_tts package as comments and docstrings.
  3. README Guidance: This README provides an overview and directs users towards the commented code for detailed explanations.

Getting Started

  1. Code Exploration: Dive into the st_tts package and examine the Python files to find Gemini's detailed comments and docstrings.
  2. Tutorial Reference: Refer to the original tutorial for context and the baseline implementation.

Model Architecture Breakdown

Textual depiction of the TransformerTTS model's component interactions:

Text Input -> Preprocessing -> Encoder -> Decoder -> Postprocessing

  • Text Input: This is the initial text you want to convert to speech.

  • Preprocessing:

    • encoder_prenet: Takes the text input, embeds characters or words, and applies linear transformations and convolutions.
    • pos_encoding: Injects positional information into the preprocessed text representation.
  • Encoder:

    • encoder_block_1, encoder_block_2, encoder_block_3: A stack of Encoder blocks that process the preprocessed text representation using self-attention and feed-forward layers.
    • Each encoder block outputs an encoded representation capturing contextual information from the input text.
    • After the final encoder block (encoder_block_3), the encoded representation is normalized using norm_memory.
  • Decoder:

    • decoder_prenet: Takes the mel-spectrogram target (typically from a teacher-forcing approach during training) and transforms it for use by the decoder.
    • decoder_block_1, decoder_block_2, decoder_block_3: A stack of Decoder blocks that generate the predicted mel-spectrogram.
    • Each decoder block uses self-attention to attend to its own outputs and encoder outputs (attention over encoded text).
    • It also uses feed-forward layers for non-linear transformations.
    • The final decoder block's output is projected using:
      • linear_1: Projects the decoder output to mel-spectrogram features for the predicted spectrogram.
      • linear_2: Projects the decoder output for stop token prediction, indicating when speech has ended.
  • Postprocessing:

    • postnet: Takes the predicted mel-spectrogram from linear_1 and refines it using convolutions for potentially improved quality.

Overall data flow: Encoded text representation from the encoder informs the decoder's mel-spectrogram prediction at each step. The decoder's output is then post-processed for potentially better quality.

Note: This is a simplified textual representation, and the actual model might have additional connections or skip-connections not explicitly shown here.

How to Train the Transformer TTS on Kaggle ⭐

The Kaggle notebook can be found @ kaggle.com/code/raul23/simple-transformer-text-to-speech

Follow these steps to train the Transformer TTS on Kaggle:

  1. Ensure you are using a GPU, preferably T4.

    While the P100 GPU does support mixed precision training, its architecture limitations may result in smaller speed improvements compared to newer NVIDIA GPUs (e.g. T4) with dedicated tensor cores. Tensor cores are specifically designed to accelerate mixed precision computations, which may lead to more pronounced performance gains on newer hardware.

    See GPU T4 vs P100 for more details.

  2. Make sure you have added the necessary inputs for training the model in your notebook:

    • /kaggle/input/ljspeech-meta/metadata.csv : Metadata CSV file containing text, audio filenames, etc. It is associated with the LJ Speech Dataset
    • /kaggle/input/the-lj-speech-dataset/LJSpeech-1.1/wavs/: Audio WAV files from the LJ Speech Dataset
  3. Preparation and Configuration: Execute all other notebook cells, particularly hyperparams.py, where you can set important hyperparameters such as input and output paths, batch_size, step_print, step_test, and step_save.

  4. Training: To begin training the TTS model, click on 'Run All'. Each cell will be executed from the top of the notebook until the end. You will see the training loop displaying information about the training, such as the epoch, steps, train and test losses.

Results

TensorBoard

I used TensorBoard to track and visualize both train and test losses, as well as display spectrograms and audio data.

Audio from TensorBoard

Useful references about TensorBoard:

How to Install TensorBoard

Here's how I installed TensorBoard within a conda environment:

  1. Install TensorBoard using conda:

    conda install tensorboard
    
  2. If you encounter a ModuleNotFoundError: No module named 'chardet' error when running TensorBoard in your terminal, install chardet:

    pip install chardet
    

Note: TensorFlow is not necessary to be installed. However, TensorBoard will warn you that it will be running with a reduced feature set.

How to Use TensorBoard

To use TensorBoard:

  1. Open your terminal and run the following command:

    tensorboard --logdir logs/
    

    Here, --logdir points to your directory containing the log files generated while training a model, which includes relevant data such as train/test losses, weights, and audio. In this project, st_tts generates log files such as events.out.tfevents.1714616902.c720ee4d6b8b.34.0 every 1000 steps (default value).

  2. Then, open your browser and go to http://localhost:6006/.

  3. Press CTRL+C in the terminal to quit TensorBoard.

Losses: Train and Test ⭐

Here are the train and test losses after training the TTS model for 68k steps (~12h hours on T4). By comparison, tttzof351 trained their model for more than 400k steps (~1 day on V100).

Train loss Test loss

Audio ⭐

The model was evaluated by generating audio samples of the phrase 'Hello, World' after each 1000 steps of training. You can listen here to the audio files generated after training for 1000, 34000, and 68000 steps.

I stopped training after 68000 steps due to reaching the 12-hour session limit for GPU training on Kaggle.

Inference from Pre-trained Transformer TTS ⭐

We will demonstrate how to perform inference from a pre-trained transformer text-to-speech (TTS) model trained on the LJ Speech Dataset. The model was trained by GitHub user tttzof351, and the provided weights were uploaded to Kaggle for convenience. The inference code presented is sourced from tttzof351's GitHub repository here and is also found in this kaggle notebook.

  1. First install the simple-transformer-tts package:
    !pip install git+https://github.com/raul23/simple-transformer-tts#egg=simple-transformer-tts
    
  2. Import the following packages and libraries:
    import IPython
    import torch
    
    from st_tts.hyperparams import hp
    from st_tts.melspecs import inverse_mel_spec_to_wav
    from st_tts.model import TransformerTTS
    from st_tts.text_to_seq import text_to_seq
    from st_tts.write_mp3 import write_mp3
  3. Load the pre-trained Transformer TTS model:
    # Path to the saved model weights file
    train_saved_path = "/kaggle/input/simple-transfer-tts/pytorch/simple-transfer-tts/1/train_SimpleTransfromerTTS.pt"
    
    # Load the saved model weights
    state = torch.load(train_saved_path)
    
    # Initialize the model architecture
    model = TransformerTTS().cuda()
    
    # Load the model weights into the initialized model
    model.load_state_dict(state["model"])
  4. This is the function that will be used to generate speeches based on short texts:
    # Define text and output file name
    # NOTE: The model is unable to generate audio for numbers or special symbols such as %
    def synthesize_text_to_speech(text="The quick brown fox jumps over the lazy dog", name_file="speech.mp3"):
        # Perform inference to generate mel spectrogram and gate output
        postnet_mel, gate = model.inference(
          text_to_seq(text).unsqueeze(0).cuda(),
          # gate_threshold=1e-5, # TODO: not supported
          with_tqdm = False
        )
    
        # Generate audio from mel spectrogram
        audio = inverse_mel_spec_to_wav(postnet_mel.detach()[0].T)
    
        # Write audio to MP3 file
        write_mp3(
            audio.detach().cpu().numpy(),
            name_file
        )
    
        # Display audio
        return IPython.display.Audio(
            audio.detach().cpu().numpy(),
            rate=hp.sr
        )
  5. Generate the speech based on your text:
    text = '''Breaking news! Scientists have discovered a new exoplanet
    potentially capable of supporting life. Further research is ongoing.'''
    
    synthesize_text_to_speech(text)

You can listen here to the audio files generated based on different types of text (e.g. emotional, factual, poetry).

Observations

GPU T4 vs P100

I found that training with the T4 GPU was quicker compared to the P100 GPU:

  • T4: 550 seconds per 1000 steps
  • P100: 750 seconds per 1000 steps

(Note: For each step, one batch of data is processed)

It might seem counterintuitive that the T4 would outperform the P100 in training, considering the P100's greater computational power. However, the reason is that the simple Transformer TTS is using mixed precision computations. This is evident from the following code snippets:

  • scaler = torch.cuda.amp.GradScaler() from train.py
  • with torch.autocast(device_type='cuda', dtype=torch.float16) from train.py

While both T4 and P100 support mixed precision, significant performance gains might not be observed on the P100.


ceshine trained a Wide ResNet model on CIFAR-10 and recorded the training times on T4 and P100 GPUs with and without mixed precision:

Training times

In the blog post, ceshine remarked the following:

  1. Training with mixed precision on T4 is almost twice as fast as with single precision, and consumes consistently less GPU memory.
  2. Training wide-resnet with mixed precision on P100 does not have any significant effect in terms of speed.

According to TensorFlow's Guide about Mixed Precision:

While mixed precision will run on most hardware, it will only speed up models on recent NVIDIA GPUs, Cloud TPUs and recent Intel CPUs. [...] The P100 has compute capability 6.0 and is not expected to show a significant speedup.


So in conclusion: while the P100 GPU does support mixed precision training, its architecture limitations may result in smaller speed improvements compared to newer NVIDIA GPUs (e.g. T4) with dedicated tensor cores. Tensor cores are specifically designed to accelerate mixed precision computations, which may lead to more pronounced performance gains on newer hardware.

Contributing

We invite your contributions! To share insights or suggest improvements, please open an issue or submit a pull request.

Acknowledgments

  • Sincere thanks to tttzof351 for creating the original Transformer TTS tutorial and code.
  • The creators of Gemini Advanced at Google AI.

About

This project offers a deeper exploration of tttzof351's "Simple Transformer TTS" codebase, enhanced with insights from Gemini Advanced, Google AI's language model.

Topics

Resources

License

Stars

Watchers

Forks