diff --git a/chapters/en/_toctree.yml b/chapters/en/_toctree.yml index 1c412087..837b1559 100644 --- a/chapters/en/_toctree.yml +++ b/chapters/en/_toctree.yml @@ -33,6 +33,8 @@ title: Audio classification with a pipeline - local: chapter2/asr_pipeline title: Automatic speech recognition with a pipeline + - local: chapter2/tts_pipeline + title: Audio generation with a pipeline - local: chapter2/hands_on title: Hands-on exercise diff --git a/chapters/en/chapter2/asr_pipeline.mdx b/chapters/en/chapter2/asr_pipeline.mdx index 2fea9c7e..5fe17eec 100644 --- a/chapters/en/chapter2/asr_pipeline.mdx +++ b/chapters/en/chapter2/asr_pipeline.mdx @@ -1,7 +1,7 @@ # Automatic speech recognition with a pipeline Automatic Speech Recognition (ASR) is a task that involves transcribing speech audio recording into text. -This task has has numerous practical applications, from creating closed captions for videos to enabling voice commands +This task has numerous practical applications, from creating closed captions for videos to enabling voice commands for virtual assistants like Siri and Alexa. In this section, we'll use the `automatic-speech-recognition` pipeline to transcribe an audio recording of a person diff --git a/chapters/en/chapter2/introduction.mdx b/chapters/en/chapter2/introduction.mdx index 849fc525..84619471 100644 --- a/chapters/en/chapter2/introduction.mdx +++ b/chapters/en/chapter2/introduction.mdx @@ -19,6 +19,6 @@ of them having a conversation. or give a voice to an NPC in a game. With 🤗 Transformers, you can easily do that! In this unit, you'll learn how to use pre-trained models for some of these tasks using the `pipeline()` function from 🤗 Transformers. -Specifically, we'll see how the pre-trained models can be used for audio classification and automatic speech recognition. +Specifically, we'll see how the pre-trained models can be used for audio classification, automatic speech recognition and audio generation. Let's get started! diff --git a/chapters/en/chapter2/tts_pipeline.mdx b/chapters/en/chapter2/tts_pipeline.mdx new file mode 100644 index 00000000..781689a0 --- /dev/null +++ b/chapters/en/chapter2/tts_pipeline.mdx @@ -0,0 +1,90 @@ +# Audio generation with a pipeline + +Audio generation encompasses a versatile set of tasks that involve producing an audio output. The tasks +that we will look into here are speech generation (aka "text-to-speech") and music generation. In text-to-speech, a +model transforms a piece of text into lifelike spoken language sound, opening the door to applications such as virtual assistants, +accessibility tools for the visually impaired, and personalized audiobooks. +On the other hand, music generation can enable creative expression, and finds its use mostly in entertainment and game +development industries. + +In 🤗 Transformers, you'll find a pipeline that covers both of these tasks. This pipeline is called `"text-to-audio"`, +but for convenience, it also has a `"text-to-speech"` alias. Here we'll use both, and you are free to pick whichever +seems more applicable for your task. + +Let's explore how you can use this pipeline to start generating audio narration for texts, and music with just a few lines of code. + +This pipeline is new to 🤗 Transformers and comes part of the version 4.32 release. Thus you'll need to upgrade the library to the latest version to get the feature: + +```bash +pip install --upgrade transformers +``` + +## Generating speech + +Let's begin by exploring text-to-speech generation. First, just as it was the case with audio classification and automatic +speech recognition, we'll need to define the pipeline. We'll define a text-to-speech pipeline since it best describes our task, and use the [`suno/bark-small`](https://huggingface.co/suno/bark-small) checkpoint: + +```python +from transformers import pipeline + +pipe = pipeline("text-to-speech", model="suno/bark-small") +``` + +The next step is as simple as passing some text through the pipeline. All the preprocessing will be done for us under the hood: + +```python +text = "Ladybugs have had important roles in culture and religion, being associated with luck, love, fertility and prophecy. " +output = pipe(text) +``` + +In a notebook, we can use the following code snippet to listen to the result: + +```python +from IPython.display import Audio + +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +The model that we're using with the pipeline, Bark, is actually multilingual, so we can easily substitute the initial +text with a text in, say, French, and use the pipeline in the exact same way. It will pick up on the language all by itself: + +```python +fr_text = "Contrairement à une idée répandue, le nombre de points sur les élytres d'une coccinelle ne correspond pas à son âge, ni en nombre d'années, ni en nombre de mois. " +output = pipe(fr_text) +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +Not only is this model multilingual, it can also generate audio with non-verbal communications and singing. Here's how +you can make it sing: + +```python +song = "♪ In the jungle, the mighty jungle, the ladybug was seen. ♪ " +output = pipe(song) +Audio(output["audio"], rate=output["sampling_rate"]) +``` + +We'll dive deeper into Bark specifics in the later unit dedicated to Text-to-speech, and will also show how you can use +other models for this task. Now, let's generate some music! + +## Generating music + +Just as before, we'll begin by instantiating a pipeline. For music generation, we'll define a text-to-audio pipeline, and initialise it with the pretrained checkpoint [`facebook/musicgen-small`](https://huggingface.co/facebook/musicgen-small) + +```python +music_pipe = pipeline("text-to-audio", model="facebook/musicgen-small") +``` + +Let's create a text description of the music we'd like to generate: + +```python +text = "90s rock song with electric guitar and heavy drums" +``` + +We can control the length of the generated output by passing an additional `max_new_tokens` parameter to the model. + +```python +forward_params = {"max_new_tokens": 512} + +output = music_pipe(text, forward_params=forward_params) +Audio(output["audio"][0], rate=output["sampling_rate"]) +```