From ee843c50f3e63fac9b97a081f0174c2139f720c1 Mon Sep 17 00:00:00 2001
From: MKhalusova <kafooster@gmail.com>
Date: Tue, 12 Sep 2023 11:27:42 -0400
Subject: [PATCH 1/5] tts pipeline

---
 chapters/en/_toctree.yml              |   2 +
 chapters/en/chapter2/asr_pipeline.mdx |   2 +-
 chapters/en/chapter2/introduction.mdx |   2 +-
 chapters/en/chapter2/tts_pipeline.mdx | 101 ++++++++++++++++++++++++++
 4 files changed, 105 insertions(+), 2 deletions(-)
 create mode 100644 chapters/en/chapter2/tts_pipeline.mdx

diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml
index 7d1500fb..4f9791cb 100644
--- a/chapters/en/_toctree.yml
+++ b/chapters/en/_toctree.yml
@@ -33,6 +33,8 @@
     title: Audio classification with a pipeline
   - local: chapter2/asr_pipeline
     title: Automatic speech recognition with a pipeline
+  - local: chapter2/tts_pipeline
+    title: Audio generation with a pipeline
   - local: chapter2/hands_on
     title: Hands-on exercise
 
diff --git a/chapters/en/chapter2/asr_pipeline.mdx b/chapters/en/chapter2/asr_pipeline.mdx
index 2fea9c7e..5fe17eec 100644
--- a/chapters/en/chapter2/asr_pipeline.mdx
+++ b/chapters/en/chapter2/asr_pipeline.mdx
@@ -1,7 +1,7 @@
 # Automatic speech recognition with a pipeline
 
 Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text.
-This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands
+This task has numerous practical applications, from creating closed captions for videos to enabling voice commands
 for virtual assistants like Siri and Alexa.
 
 In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person
diff --git a/chapters/en/chapter2/introduction.mdx b/chapters/en/chapter2/introduction.mdx
index 849fc525..84619471 100644
--- a/chapters/en/chapter2/introduction.mdx
+++ b/chapters/en/chapter2/introduction.mdx
@@ -19,6 +19,6 @@ of them having a conversation.
 or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that!
 
 In this unit, you'll learn how to use pre-trained models for some of these tasks using the `pipeline()` function from 🤗 Transformers.
-Specifically, we'll see how the pre-trained models can be used for audio classification and automatic speech recognition.
+Specifically, we'll see how the pre-trained models can be used for audio classification, automatic speech recognition and audio generation.
 Let's get started!
 
diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx
new file mode 100644
index 00000000..aa6811e4
--- /dev/null
+++ b/chapters/en/chapter2/tts_pipeline.mdx
@@ -0,0 +1,101 @@
+# Audio generation with a pipeline
+
+Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks 
+that we will look into here are speech generation (aka Text-to-speech task) and music generation. In text-to-speech, a 
+model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, 
+accessibility tools for the visually impaired, and personalized audiobooks. 
+On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game 
+development industries. 
+
+In 🤗 Transformers, you'll find a pipeline that covers both of these tasks. This pipeline is called `"text-to-audio"`, 
+but for convenience, it also has a `"text-to-speech"` alias. Here we'll use both, and you are free to pick whichever 
+seems more descriptive for your task. 
+
+Let's explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code.
+
+This pipeline is new to 🤗 Transformers, thus you'll need to install the library from the source: 
+
+```bash
+pip install git+https://github.com/huggingface/transformers.git
+```
+
+## Generating speech
+
+Let's begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic 
+speech recognition, we'll need to define the pipeline. We'll use the `suno/bark-small` model with this pipeline:
+
+```python
+from transformers import pipeline
+
+pipe = pipeline("text-to-speech", model="suno/bark-small")
+```
+
+The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood: 
+
+```python
+text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. "
+output = pipe(text)
+```
+
+In a notebook, we can use the following code snippet to listen to the result: 
+
+```python
+from IPython.display import Audio
+Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+The model that we're using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial 
+text with a text in, say, French, and use the pipeline in the exact same way. It will pick up on the language all by itself:
+
+```python
+fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. "
+output = pipe(fr_text)
+Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here's how 
+you can make it sing: 
+
+```python
+song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ "
+output = pipe(song)
+Audio(output["audio"], rate=output["sampling_rate"])
+```
+
+We'll dive deeper into Bark specifics in the later unit dedicated to Text-to-speech, and will also show how you can use 
+other models for this task. Now, let's generate some music!
+
+## Generating music
+
+Just as before, we'll begin by instantiating a pipeline. For music generation, we'll take the pretrained `facebook/musicgen-small` 
+checkpoint.  
+
+```python
+music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")
+```
+
+Let's create a text description of the music we'd like to generate:
+
+```python
+text = "90s rock song with electric guitar and heavy drums"
+```
+
+For best results, we'll specify some additional music generation parameters to pass to `musicgen`. These are model-, and 
+not pipeline-specific. 
+
+- `do_sample` introduces some variability and a bit of randomness to improve the "creativeness" of the output
+- `max_new_tokens` controls the length of the generated output
+- higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default.  
+
+```python
+forward_params = {
+    "do_sample": True,  
+    "max_new_tokens": 512,
+    "guidance_scale": 3
+}
+
+output = music_pipe(text, forward_params=forward_params)
+Audio(output["audio"][0], rate=32000)
+```
+
+Note: the sampling rate value for music generation comes from the configuration of the model.

From 34a06c45988e0dabc7dac554c198a75009373675 Mon Sep 17 00:00:00 2001
From: MKhalusova <kafooster@gmail.com>
Date: Tue, 12 Sep 2023 11:49:09 -0400
Subject: [PATCH 2/5] make style

---
 chapters/en/chapter2/tts_pipeline.mdx | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx
index aa6811e4..00364f82 100644
--- a/chapters/en/chapter2/tts_pipeline.mdx
+++ b/chapters/en/chapter2/tts_pipeline.mdx
@@ -41,6 +41,7 @@ In a notebook, we can use the following code snippet to listen to the result:
 
 ```python
 from IPython.display import Audio
+
 Audio(output["audio"], rate=output["sampling_rate"])
 ```
 
@@ -88,11 +89,7 @@ not pipeline-specific.
 - higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default.  
 
 ```python
-forward_params = {
-    "do_sample": True,  
-    "max_new_tokens": 512,
-    "guidance_scale": 3
-}
+forward_params = {"do_sample": True, "max_new_tokens": 512, "guidance_scale": 3}
 
 output = music_pipe(text, forward_params=forward_params)
 Audio(output["audio"][0], rate=32000)

From 3663371d2485676a53cad4a63c2af14eb9985225 Mon Sep 17 00:00:00 2001
From: Maria Khalusova <kafooster@gmail.com>
Date: Wed, 13 Sep 2023 07:53:32 -0400
Subject: [PATCH 3/5] Apply suggestions from code review

Co-authored-by: Sanchit Gandhi <93869735+sanchit-gandhi@users.noreply.github.com>
---
 chapters/en/chapter2/tts_pipeline.mdx | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx
index 00364f82..d922f17e 100644
--- a/chapters/en/chapter2/tts_pipeline.mdx
+++ b/chapters/en/chapter2/tts_pipeline.mdx
@@ -1,7 +1,7 @@
 # Audio generation with a pipeline
 
 Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks 
-that we will look into here are speech generation (aka Text-to-speech task) and music generation. In text-to-speech, a 
+that we will look into here are speech generation (aka "text-to-speech") and music generation. In text-to-speech, a 
 model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, 
 accessibility tools for the visually impaired, and personalized audiobooks. 
 On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game 
@@ -9,20 +9,20 @@ development industries.
 
 In 🤗 Transformers, you'll find a pipeline that covers both of these tasks. This pipeline is called `"text-to-audio"`, 
 but for convenience, it also has a `"text-to-speech"` alias. Here we'll use both, and you are free to pick whichever 
-seems more descriptive for your task. 
+seems more applicable for your task. 
 
 Let's explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code.
 
-This pipeline is new to 🤗 Transformers, thus you'll need to install the library from the source: 
+This pipeline is new to 🤗 Transformers and comes part of the version 4.32 release. Thus you'll need to upgrade the library to the latest version to get the feature:
 
 ```bash
-pip install git+https://github.com/huggingface/transformers.git
+pip install --upgrade transformers
 ```
 
 ## Generating speech
 
 Let's begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic 
-speech recognition, we'll need to define the pipeline. We'll use the `suno/bark-small` model with this pipeline:
+speech recognition, we'll need to define the pipeline. We'll define a text-to-speech pipeline since it best describes our task, and use the [`suno/bark-small`](https://huggingface.co/suno/bark-small) checkpoint:
 
 ```python
 from transformers import pipeline
@@ -68,8 +68,7 @@ other models for this task. Now, let's generate some music!
 
 ## Generating music
 
-Just as before, we'll begin by instantiating a pipeline. For music generation, we'll take the pretrained `facebook/musicgen-small` 
-checkpoint.  
+Just as before, we'll begin by instantiating a pipeline. For music generation, we'll define a text-to-audio pipeline, and initialise it with the pretrained checkpoint [`facebook/musicgen-small`](https://huggingface.co/facebook/musicgen-small) 
 
 ```python
 music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small")

From 8602ed5cde00bb7e01ba8fd10b8d812b341f1681 Mon Sep 17 00:00:00 2001
From: MKhalusova <kafooster@gmail.com>
Date: Wed, 13 Sep 2023 09:08:07 -0400
Subject: [PATCH 4/5] simplified the forward_params

---
 chapters/en/chapter2/tts_pipeline.mdx | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx
index d922f17e..229777da 100644
--- a/chapters/en/chapter2/tts_pipeline.mdx
+++ b/chapters/en/chapter2/tts_pipeline.mdx
@@ -80,15 +80,10 @@ Let's create a text description of the music we'd like to generate:
 text = "90s rock song with electric guitar and heavy drums"
 ```
 
-For best results, we'll specify some additional music generation parameters to pass to `musicgen`. These are model-, and 
-not pipeline-specific. 
-
-- `do_sample` introduces some variability and a bit of randomness to improve the "creativeness" of the output
-- `max_new_tokens` controls the length of the generated output
-- higher `guidance_scale` encourages the model to generate samples more closely linked to the text prompt (at the expense of the audio quality). Guidance scale of 3 is a recommended default.  
+We can control the length of the generated output by passing an additional `max_new_tokens` parameter to the model. 
 
 ```python
-forward_params = {"do_sample": True, "max_new_tokens": 512, "guidance_scale": 3}
+forward_params = {"max_new_tokens": 512}
 
 output = music_pipe(text, forward_params=forward_params)
 Audio(output["audio"][0], rate=32000)

From 39d948a5ccedb5f7b023db0206b932f2ee488b8e Mon Sep 17 00:00:00 2001
From: MKhalusova <kafooster@gmail.com>
Date: Thu, 14 Sep 2023 11:59:23 -0400
Subject: [PATCH 5/5] updated musicgen sampling rate

---
 chapters/en/chapter2/tts_pipeline.mdx | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx
index 229777da..781689a0 100644
--- a/chapters/en/chapter2/tts_pipeline.mdx
+++ b/chapters/en/chapter2/tts_pipeline.mdx
@@ -86,7 +86,5 @@ We can control the length of the generated output by passing an additional `max_
 forward_params = {"max_new_tokens": 512}
 
 output = music_pipe(text, forward_params=forward_params)
-Audio(output["audio"][0], rate=32000)
+Audio(output["audio"][0], rate=output["sampling_rate"])
 ```
-
-Note: the sampling rate value for music generation comes from the configuration of the model.