Skip to content

Commit

Permalink
Merge pull request #136 from ylacombe/add-bark-tts
Browse files Browse the repository at this point in the history
Add explanations on encodec and codebooks in TTS chapter
  • Loading branch information
MKhalusova authored Sep 1, 2023
2 parents 177da91 + b4fd0f9 commit 31addbf
Showing 1 changed file with 6 additions and 3 deletions.
9 changes: 6 additions & 3 deletions chapters/en/chapter6/pre-trained_models.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,11 @@ pre-trained checkpoint only supports English language:

Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark).

Bark is made of 4 main models:
Unlike SpeechT5, Bark generates raw speech waveforms directly, eliminating the need for a separate vocoder during inference – it's already integrated. This efficiency is achieved through the utilization of [`Encodec`](https://huggingface.co/docs/transformers/main/en/model_doc/encodec), which serves as both a codec and a compression tool.

With `Encodec`, you can compress audio into a lightweight format to reduce memory usage and subsequently decompress it to restore the original audio. This compression process is facilitated by 8 codebooks, each consisting of integer vectors. Think of these codebooks as representations or embeddings of the audio in integer form. It's important to note that each successive codebook improves the quality of the audio reconstruction from the previous codebooks. As codebooks are integer vectors, they can be learned by transformer models, which are very efficient in this task. This is what Bark was specifically trained to do.

To be more specific, Bark is made of 4 main models:

- `BarkSemanticModel` (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text.
- `BarkCoarseModel` (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the `BarkSemanticModel` model. It aims at predicting the first two audio codebooks necessary for EnCodec.
Expand Down Expand Up @@ -273,9 +277,8 @@ speech_output = model.generate(**inputs).cpu().numpy()
Your browser does not support the audio element.
</audio>

Unlike SpeechT5, Bark directly generates raw speech waveforms. This means that you do not need to add a vocoder for inference, it's already "built-in".

In addition, Bark supports batch processing, which means you can process several text entries at the same time, at the expense of more intensive computation.
In addition to all these features, Bark supports batch processing, which means you can process several text entries at the same time, at the expense of more intensive computation.
On some hardware, such as GPUs, batching enables faster overall generation, which means it can be faster to generate samples all at once than to generate them one by one.

Let's try generating a few examples:
Expand Down

0 comments on commit 31addbf

Please sign in to comment.