From ee843c50f3e63fac9b97a081f0174c2139f720c1 Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Tue, 12 Sep 2023 11:27:42 -0400 Subject: [PATCH 1/5] tts pipeline --- chapters/en/_toctree.yml | 2 + chapters/en/chapter2/asr_pipeline.mdx | 2 +- chapters/en/chapter2/introduction.mdx | 2 +- chapters/en/chapter2/tts_pipeline.mdx | 101 ++++++++++++++++++++++++++ 4 files changed, 105 insertions(+), 2 deletions(-) create mode 100644 chapters/en/chapter2/tts_pipeline.mdx diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 7d1500fb..4f9791cb 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -33,6 +33,8 @@ title: Audio classification with a pipeline - local: chapter2/asr_pipeline title: Automatic speech recognition with a pipeline + - local: chapter2/tts_pipeline + title: Audio generation with a pipeline - local: chapter2/hands_on title: Hands-on exercise diff --git a/chapters/en/chapter2/asr_pipeline.mdx b/chapters/en/chapter2/asr_pipeline.mdx index 2fea9c7e..5fe17eec 100644 --- a/chapters/en/chapter2/asr_pipeline.mdx +++ b/chapters/en/chapter2/asr_pipeline.mdx @@ -1,7 +1,7 @@ # Automatic speech recognition with a pipeline Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. -This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands +This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa. In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person diff --git a/chapters/en/chapter2/introduction.mdx b/chapters/en/chapter2/introduction.mdx index 849fc525..84619471 100644 --- a/chapters/en/chapter2/introduction.mdx +++ b/chapters/en/chapter2/introduction.mdx @@ -19,6 +19,6 @@ of them having a conversation. or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that! In this unit, you'll learn how to use pre-trained models for some of these tasks using the `pipeline()` function from 🤗 Transformers. -Specifically, we'll see how the pre-trained models can be used for audio classification and automatic speech recognition. +Specifically, we'll see how the pre-trained models can be used for audio classification, automatic speech recognition and audio generation. Let's get started! diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx new file mode 100644 index 00000000..aa6811e4 --- /dev/null +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -0,0 +1,101 @@ +# Audio generation with a pipeline + +Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks +that we will look into here are speech generation (aka Text-to-speech task) and music generation. In text-to-speech, a +model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, +accessibility tools for the visually impaired, and personalized audiobooks. +On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game +development industries. + +In 🤗 Transformers, you'll find a pipeline that covers both of these tasks. This pipeline is called `"text-to-audio"`, +but for convenience, it also has a `"text-to-speech"` alias. Here we'll use both, and you are free to pick whichever +seems more descriptive for your task. + +Let's explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code. + +This pipeline is new to 🤗 Transformers, thus you'll need to install the library from the source: + +```bash +pip install git+https://github.com/huggingface/transformers.git +``` + +## Generating speech + +Let's begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic +speech recognition, we'll need to define the pipeline. We'll use the `suno/bark-small` model with this pipeline: + +```python +from transformers import pipeline + +pipe = pipeline("text-to-speech", model="suno/bark-small") +``` + +The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood: + +```python +text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. " +output = pipe(text) +``` + +In a notebook, we can use the following code snippet to listen to the result: + +```python +from IPython.display import Audio +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +The model that we're using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial +text with a text in, say, French, and use the pipeline in the exact same way. It will pick up on the language all by itself: + +```python +fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. " +output = pipe(fr_text) +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here's how +you can make it sing: + +```python +song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ " +output = pipe(song) +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +We'll dive deeper into Bark specifics in the later unit dedicated to Text-to-speech, and will also show how you can use +other models for this task. Now, let's generate some music! + +## Generating music + +Just as before, we'll begin by instantiating a pipeline. For music generation, we'll take the pretrained `facebook/musicgen-small` +checkpoint. + +```python +music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small") +``` + +Let's create a text description of the music we'd like to generate: + +```python +text = "90s rock song with electric guitar and heavy drums" +``` + +For best results, we'll specify some additional music generation parameters to pass to `musicgen`. These are model-, and +not pipeline-specific. + +- `do_sample` introduces some variability and a bit of randomness to improve the "creativeness" of the output +- `max_new_tokens` controls the length of the generated output +- higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default. + +```python +forward_params = { + "do_sample": True, + "max_new_tokens": 512, + "guidance_scale": 3 +} + +output = music_pipe(text, forward_params=forward_params) +Audio(output["audio"][0], rate=32000) +``` + +Note: the sampling rate value for music generation comes from the configuration of the model. From 34a06c45988e0dabc7dac554c198a75009373675 Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Tue, 12 Sep 2023 11:49:09 -0400 Subject: [PATCH 2/5] make style --- chapters/en/chapter2/tts_pipeline.mdx | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx index aa6811e4..00364f82 100644 --- a/chapters/en/chapter2/tts_pipeline.mdx +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -41,6 +41,7 @@ In a notebook, we can use the following code snippet to listen to the result: ```python from IPython.display import Audio + Audio(output["audio"], rate=output["sampling_rate"]) ``` @@ -88,11 +89,7 @@ not pipeline-specific. - higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default. ```python -forward_params = { - "do_sample": True, - "max_new_tokens": 512, - "guidance_scale": 3 -} +forward_params = {"do_sample": True, "max_new_tokens": 512, "guidance_scale": 3} output = music_pipe(text, forward_params=forward_params) Audio(output["audio"][0], rate=32000) From 3663371d2485676a53cad4a63c2af14eb9985225 Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Wed, 13 Sep 2023 07:53:32 -0400 Subject: [PATCH 3/5] Apply suggestions from code review Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com> --- chapters/en/chapter2/tts_pipeline.mdx | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx index 00364f82..d922f17e 100644 --- a/chapters/en/chapter2/tts_pipeline.mdx +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -1,7 +1,7 @@ # Audio generation with a pipeline Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks -that we will look into here are speech generation (aka Text-to-speech task) and music generation. In text-to-speech, a +that we will look into here are speech generation (aka "text-to-speech") and music generation. In text-to-speech, a model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, accessibility tools for the visually impaired, and personalized audiobooks. On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game @@ -9,20 +9,20 @@ development industries. In 🤗 Transformers, you'll find a pipeline that covers both of these tasks. This pipeline is called `"text-to-audio"`, but for convenience, it also has a `"text-to-speech"` alias. Here we'll use both, and you are free to pick whichever -seems more descriptive for your task. +seems more applicable for your task. Let's explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code. -This pipeline is new to 🤗 Transformers, thus you'll need to install the library from the source: +This pipeline is new to 🤗 Transformers and comes part of the version 4.32 release. Thus you'll need to upgrade the library to the latest version to get the feature: ```bash -pip install git+https://github.com/huggingface/transformers.git +pip install --upgrade transformers ``` ## Generating speech Let's begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic -speech recognition, we'll need to define the pipeline. We'll use the `suno/bark-small` model with this pipeline: +speech recognition, we'll need to define the pipeline. We'll define a text-to-speech pipeline since it best describes our task, and use the [`suno/bark-small`](https://huggingface.co/suno/bark-small) checkpoint: ```python from transformers import pipeline @@ -68,8 +68,7 @@ other models for this task. Now, let's generate some music! ## Generating music -Just as before, we'll begin by instantiating a pipeline. For music generation, we'll take the pretrained `facebook/musicgen-small` -checkpoint. +Just as before, we'll begin by instantiating a pipeline. For music generation, we'll define a text-to-audio pipeline, and initialise it with the pretrained checkpoint [`facebook/musicgen-small`](https://huggingface.co/facebook/musicgen-small) ```python music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small") From 8602ed5cde00bb7e01ba8fd10b8d812b341f1681 Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Wed, 13 Sep 2023 09:08:07 -0400 Subject: [PATCH 4/5] simplified the forward_params --- chapters/en/chapter2/tts_pipeline.mdx | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx index d922f17e..229777da 100644 --- a/chapters/en/chapter2/tts_pipeline.mdx +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -80,15 +80,10 @@ Let's create a text description of the music we'd like to generate: text = "90s rock song with electric guitar and heavy drums" ``` -For best results, we'll specify some additional music generation parameters to pass to `musicgen`. These are model-, and -not pipeline-specific. - -- `do_sample` introduces some variability and a bit of randomness to improve the "creativeness" of the output -- `max_new_tokens` controls the length of the generated output -- higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default. +We can control the length of the generated output by passing an additional `max_new_tokens` parameter to the model. ```python -forward_params = {"do_sample": True, "max_new_tokens": 512, "guidance_scale": 3} +forward_params = {"max_new_tokens": 512} output = music_pipe(text, forward_params=forward_params) Audio(output["audio"][0], rate=32000) From 39d948a5ccedb5f7b023db0206b932f2ee488b8e Mon Sep 17 00:00:00 2001 From: MKhalusova Date: Thu, 14 Sep 2023 11:59:23 -0400 Subject: [PATCH 5/5] updated musicgen sampling rate --- chapters/en/chapter2/tts_pipeline.mdx | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx index 229777da..781689a0 100644 --- a/chapters/en/chapter2/tts_pipeline.mdx +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -86,7 +86,5 @@ We can control the length of the generated output by passing an additional `max_ forward_params = {"max_new_tokens": 512} output = music_pipe(text, forward_params=forward_params) -Audio(output["audio"][0], rate=32000) +Audio(output["audio"][0], rate=output["sampling_rate"]) ``` - -Note: the sampling rate value for music generation comes from the configuration of the model.