From b4fd0f97a8d9449bc8b4e62fafaf55c6e233169d Mon Sep 17 00:00:00 2001 From: Yoach Lacombe <52246514+ylacombe@users.noreply.github.com> Date: Fri, 1 Sep 2023 18:36:14 +0200 Subject: [PATCH] Add explanations on encodec and codebooks --- chapters/en/chapter6/pre-trained_models.mdx | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/chapters/en/chapter6/pre-trained_models.mdx b/chapters/en/chapter6/pre-trained_models.mdx index 376c4fc7..8e418d18 100644 --- a/chapters/en/chapter6/pre-trained_models.mdx +++ b/chapters/en/chapter6/pre-trained_models.mdx @@ -187,7 +187,11 @@ pre-trained checkpoint only supports English language: Bark is a transformer-based text-to-speech model proposed by Suno AI in [suno-ai/bark](https://github.com/suno-ai/bark). -Bark is made of 4 main models: +Unlike SpeechT5, Bark generates raw speech waveforms directly, eliminating the need for a separate vocoder during inference – it's already integrated. This efficiency is achieved through the utilization of [`Encodec`](https://huggingface.co/docs/transformers/main/en/model_doc/encodec), which serves as both a codec and a compression tool. + +With `Encodec`, you can compress audio into a lightweight format to reduce memory usage and subsequently decompress it to restore the original audio. This compression process is facilitated by 8 codebooks, each consisting of integer vectors. Think of these codebooks as representations or embeddings of the audio in integer form. It's important to note that each successive codebook improves the quality of the audio reconstruction from the previous codebooks. As codebooks are integer vectors, they can be learned by transformer models, which are very efficient in this task. This is what Bark was specifically trained to do. + +To be more specific, Bark is made of 4 main models: - `BarkSemanticModel` (also referred to as the 'text' model): a causal auto-regressive transformer model that takes as input tokenized text, and predicts semantic text tokens that capture the meaning of the text. - `BarkCoarseModel` (also referred to as the 'coarse acoustics' model): a causal autoregressive transformer, that takes as input the results of the `BarkSemanticModel` model. It aims at predicting the first two audio codebooks necessary for EnCodec. @@ -273,9 +277,8 @@ speech_output = model.generate(**inputs).cpu().numpy() Your browser does not support the audio element. -Unlike SpeechT5, Bark directly generates raw speech waveforms. This means that you do not need to add a vocoder for inference, it's already "built-in". -In addition, Bark supports batch processing, which means you can process several text entries at the same time, at the expense of more intensive computation. +In addition to all these features, Bark supports batch processing, which means you can process several text entries at the same time, at the expense of more intensive computation. On some hardware, such as GPUs, batching enables faster overall generation, which means it can be faster to generate samples all at once than to generate them one by one. Let's try generating a few examples: