Skip to content

Commit

Permalink
Merge pull request #70 from sanchit-gandhi/u5-public
Browse files Browse the repository at this point in the history
U5 - ASR
  • Loading branch information
MKhalusova authored Jun 27, 2023
2 parents a77d4ea + 7b556e3 commit 0245987
Show file tree
Hide file tree
Showing 19 changed files with 1,654 additions and 33 deletions.
39 changes: 18 additions & 21 deletions chapters/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -65,29 +65,26 @@
- local: chapter4/hands_on
title: Hands-on exercise

#- title: Unit 5. Transcribe a meeting recording
# sections:
# - local: chapter5/introduction
# title: What you'll learn and what you'll build
# - local: chapter5/choosing_dataset
# title: Choosing a dataset
# - local: chapter5/asr_models
# title: Pre-trained models for automatic speech recognition
# - local: chapter5/preprocessing_data
# title: Loading and preprocessing data
# - local: chapter5/evaluation
# title: Evaluation metrics for ASR
# - local: chapter5/fine-tuning
# title: Fine-tuning the ASR model
- title: Unit 5. Automatic Speech Recognition
sections:
- local: chapter5/introduction
title: What you'll learn and what you'll build
- local: chapter5/asr_models
title: Pre-trained models for speech recognition
- local: chapter5/choosing_dataset
title: Choosing a dataset
- local: chapter5/evaluation
title: Evaluation and metrics for speech recognition
- local: chapter5/fine-tuning
title: How to fine-tune an ASR system with the Trainer API
# - local: chapter5/speaker_diarization
# title: Automatic speech recognition with speaker diarization
# - local: chapter5/quiz
# title: Quiz
# quiz: 5
# - local: chapter5/hands_on
# title: Hands-on exercise
# - local: chapter5/supplemental_reading
# title: Supplemental reading and resources
- local: chapter5/demo
title: Building a demo
- local: chapter5/hands_on
title: Hands-on exercise
- local: chapter5/supplemental_reading
title: Supplemental reading and resources
#
#- title: Unit 6. From text to speech
# sections:
Expand Down
396 changes: 396 additions & 0 deletions chapters/en/chapter5/asr_models.mdx

Large diffs are not rendered by default.

126 changes: 126 additions & 0 deletions chapters/en/chapter5/choosing_dataset.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,126 @@
# Choosing a dataset

As with any machine learning problem, our model is only as good as the data that we train it on. Speech recognition
datasets vary considerably in how they are curated and the domains that they cover. To pick the right dataset, we need
to match our criteria with the features that a dataset offers.

Before we pick a dataset, we first need to understand the key defining features.

## Features of speech datasets

### 1. Number of hours
Simply put, the number of training hours indicates how large the dataset is. It’s analogous to the number of training
examples in an NLP dataset. However, bigger datasets aren’t necessarily better. If we want a model that generalises well,
we want a **diverse** dataset with lots of different speakers, domains and speaking styles.

### 2. Domain
The domain entails where the data was sourced from, whether it be audiobooks, podcasts, YouTube or financial meetings.
Each domain has a different distribution of data. For example, audiobooks are recorded in high-quality studio conditions
(with no background noise) and text that is taken from written literature. Whereas for YouTube, the audio likely contains
more background noise and a more informal style of speech.

We need to match our domain to the conditions we anticipate at inference time. For instance, if we train our model on
audiobooks, we can’t expect it to perform well in noisy environments.

### 3. Speaking style
The speaking style falls into one of two categories:

* Narrated: read from a script
* Spontaneous: un-scripted, conversational speech

The audio and text data reflect the style of speaking. Since narrated text is scripted, it tends to be spoken articulately
and without any errors:

```
“Consider the task of training a model on a speech recognition dataset”
```

Whereas for spontaneous speech, we can expect a more colloquial style of speech, with the inclusion of repetitions,
hesitations and false-starts:

```
“Let’s uhh let's take a look at how you'd go about training a model on uhm a sp- speech recognition dataset”
```

### 4. Transcription style
The transcription style refers to whether the target text has punctuation, casing or both. If we want a system to generate
fully formatted text that could be used for a publication or meeting transcription, we require training data with punctuation
and casing. If we just require the spoken words in an un-formatted structure, neither punctuation nor casing are necessary.
In this case, we can either pick a dataset without punctuation or casing, or pick one that has punctuation and casing and
then subsequently remove them from the target text through pre-processing.

## A summary of datasets on the Hub

Here is a summary of the most popular English speech recognition datasets on the Hugging Face Hub:

| Dataset | Train Hours | Domain | Speaking Style | Casing | Punctuation | License | Recommended Use |
|-----------------------------------------------------------------------------------------|-------------|-----------------------------|-----------------------|--------|-------------|-----------------|----------------------------------|
| [LibriSpeech](https://huggingface.co/datasets/librispeech_asr) | 960 | Audiobook | Narrated ||| CC-BY-4.0 | Academic benchmarks |
| [Common Voice 11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) | 3000 | Wikipedia | Narrated ||| CC0-1.0 | Non-native speakers |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 540 | European Parliament | Oratory ||| CC0 | Non-native speakers |
| [TED-LIUM](https://huggingface.co/datasets/LIUM/tedlium) | 450 | TED talks | Oratory ||| CC-BY-NC-ND 3.0 | Technical topics |
| [GigaSpeech](https://huggingface.co/datasets/speechcolab/gigaspeech) | 10000 | Audiobook, podcast, YouTube | Narrated, spontaneous ||| apache-2.0 | Robustness over multiple domains |
| [SPGISpeech](https://huggingface.co/datasets/kensho/spgispeech) | 5000 | Financial meetings | Oratory, spontaneous ||| User Agreement | Fully formatted transcriptions |
| [Earnings-22](https://huggingface.co/datasets/revdotcom/earnings22) | 119 | Financial meetings | Oratory, spontaneous ||| CC-BY-SA-4.0 | Diversity of accents |
| [AMI](https://huggingface.co/datasets/edinburghcstr/ami) | 100 | Meetings | Spontaneous ||| CC-BY-4.0 | Noisy speech conditions |

This table serves as a reference for selecting a dataset based on your criterion. Below is an equivalent table for
multilingual speech recognition. Note that we omit the train hours column, since this varies depending on the language
for each dataset, and replace it with the number of languages per dataset:

| Dataset | Languages | Domain | Speaking Style | Casing | Punctuation | License | Recommended Usage |
|-----------------------------------------------------------------------------------------------|-----------|---------------------------------------|----------------|--------|-------------|-----------|-------------------------|
| [Multilingual LibriSpeech](https://huggingface.co/datasets/facebook/multilingual_librispeech) | 6 | Audiobooks | Narrated ||| CC-BY-4.0 | Academic benchmarks |
| [Common Voice 13](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0) | 108 | Wikipedia text & crowd-sourced speech | Narrated ||| CC0-1.0 | Diverse speaker set |
| [VoxPopuli](https://huggingface.co/datasets/facebook/voxpopuli) | 15 | European Parliament recordings | Spontaneous ||| CC0 | European languages |
| [FLEURS](https://huggingface.co/datasets/google/fleurs) | 101 | European Parliament recordings | Spontaneous ||| CC-BY-4.0 | Multilingual evaluation |

For a detailed breakdown of the audio datasets covered in both tables, refer to the blog post [A Complete Guide to Audio Datasets](https://huggingface.co/blog/audio-datasets#a-tour-of-audio-datasets-on-the-hub).
While there are over 180 speech recognition datasets on the Hub, it may be possible that there isn't a dataset that matches
your needs. In this case, it's also possible to use your own audio data with 🤗 Datasets. To create a custom audio dataset,
refer to the guide [Create an audio dataset](https://huggingface.co/docs/datasets/audio_dataset). When creating a custom
audio dataset, consider sharing the final dataset on the Hub so that others in the community can benefit from your
efforts - the audio community is inclusive and wide-ranging, and others will appreciate your work as you do theirs.

Alright! Now that we've gone through all the criterion for selecting an ASR dataset, let's pick one for the purpose of this tutorial.
We know that Whisper already does a pretty good job at transcribing data in high-resource languages (such as English and Spanish), so
we'll focus ourselves on low-resource multilingual transcription. We want to retain Whisper's ability to predict punctuation and casing,
so it seems from the second table that Common Voice 13 is a great candidate dataset!

## Common Voice 13

Common Voice 13 is a crowd-sourced dataset where speakers record text from Wikipedia in various languages. It forms part of
the Common Voice series, a collection of Common Voice datasets released by Mozilla Foundation. At the time of writing,
Common Voice 13 is the latest edition of the dataset, with the most languages and hours per language out of any release to date.

We can get the full list of languages for the Common Voice 13 dataset by checking-out the dataset page on the Hub:
[mozilla-foundation/common_voice_13_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_13_0).
The first time you view this page, you'll be asked to accept the terms of use. After that, you'll be given full access to the dataset.

Once we've provided authentication to use the dataset, we'll be presented with the dataset preview. The dataset preview
shows us the first 100 samples of the dataset for each language. What's more, it's loaded up with audio samples ready for us
to listen to in real time. For this Unit, we'll select [_Dhivehi_](https://en.wikipedia.org/wiki/Maldivian_language)
(or _Maldivian_), an Indo-Aryan language spoken in the South Asian island country of the Maldives. While we're selecting
Dhivehi for this tutorial, the steps covered here apply to any one of the 108 languages in the Common Voice 13 dataset, and
more generally to any one of the 180+ audio datasets on the Hugging Face Hub, so there's no restriction on language or dialect.

We can select the Dhivehi subset of Common Voice 13 by setting the subset to `dv` using the dropdown menu (`dv` being the language
identifier code for Dhivehi):

<div class="flex justify-center">
<img src="https://huggingface.co/datasets/huggingface-course/audio-course-images/resolve/main/cv_13_dv_selection.png" alt="Selecting the Dhivehi split from the Dataset's Preview">
</div>

If we hit the play button on the first sample, we can listen to the audio and see the corresponding text. Have a scroll
through the samples for the train and test sets to get a better feel for the audio and text data that we're dealing with.
You can tell from the intonation and style that the recordings are taken from narrated speech. You'll also likely notice
the large variation in speakers and recording quality, a common trait of crowd-sourced data.

The Dataset Preview is a brilliant way of experiencing audio datasets before committing to using them. You can pick any
dataset on the Hub, scroll through the samples and listen to the audio for the different subsets and splits, gauging whether
it's the right dataset for your needs. Once you've selected a dataset, it's trivial to load the data so that you can
start using it.

Now, I personally don't speak Dhivehi, and expect the vast majority of readers not to either! To know if our fine-tuned model
is any good, we'll need a rigorous way of _evaluating_ it on unseen data and measuring its transcription accuracy.
We'll cover exactly this in the next section!
89 changes: 89 additions & 0 deletions chapters/en/chapter5/demo.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Build a demo with Gradio

Now that we've fine-tuned a Whisper model for Dhivehi speech recognition, let's go ahead and build a [Gradio]((https://gradio.app))
demo to showcase it to the community!

The first thing to do is load up the fine-tuned checkpoint using the `pipeline()` class - this is very familiar now from
the section on [pre-trained models](asr_models.mdx). You can change the `model_id` to the namespace of your fine-tuned
model on the Hugging Face Hub, or one of the pre-trained [Whisper models](https://huggingface.co/models?sort=downloads&search=openai%2Fwhisper-)
to perform zero-shot speech recognition:

```python
from transformers import pipeline

model_id = "sanchit-gandhi/whisper-small-dv" # update with your model id
pipe = pipeline("automatic-speech-recognition", model=model_id)
```

Secondly, we'll define a function that takes the filepath for an audio input and passes it through the pipeline. Here,
the pipeline automatically takes care of loading the audio file, resampling it to the correct sampling rate, and running
inference with the model. We can then simply return the transcribed text as the output of the function. To ensure our
model can handle audio inputs of arbitrary length, we'll enable *chunking* as described in the section
on [pre-trained models](asr_models.mdx):

```python
def transcribe_speech(filepath):
output = pipe(
filepath,
max_new_tokens=256,
generate_kwargs={
"task": "transcribe",
"language": "sinhalese",
}, # update with the language you've fine-tuned on
chunk_length_s=30,
batch_size=8,
)
return output["text"]
```

We'll use the Gradio [blocks](https://gradio.app/docs/#blocks) feature to launch two tabs on our demo: one for microphone
transcription, and the other for file upload.

```python
import gradio as gr

demo = gr.Blocks()

mic_transcribe = gr.Interface(
fn=transcribe_speech,
inputs=gr.Audio(source="microphone", type="filepath"),
outputs=gr.outputs.Textbox(),
)

file_transcribe = gr.Interface(
fn=transcribe_speech,
inputs=gr.Audio(source="upload", type="filepath"),
outputs=gr.outputs.Textbox(),
)
```

Finally, we launch the Gradio demo using the two blocks that we've just defined:

```python
with demo:
gr.TabbedInterface(
[mic_transcribe, file_transcribe],
["Transcribe Microphone", "Transcribe Audio File"],
)

demo.launch(debug=True)
```

This will launch a Gradio demo similar to the one running on the Hugging Face Space:

<iframe src="https://course-demos-whisper-small.hf.space" frameBorder="0" height="450" title="Gradio app" class="container p-0 flex-grow space-iframe" allow="accelerometer; ambient-light-sensor; autoplay; battery; camera; document-domain; encrypted-media; fullscreen; geolocation; gyroscope; layout-animations; legacy-image-formats; magnetometer; microphone; midi; oversized-images; payment; picture-in-picture; publickey-credentials-get; sync-xhr; usb; vr ; wake-lock; xr-spatial-tracking" sandbox="allow-forms allow-modals allow-popups allow-popups-to-escape-sandbox allow-same-origin allow-scripts allow-downloads"></iframe>

Should you wish to host your demo on the Hugging Face Hub, you can use this Space as a template for your fine-tuned model.

Click the link to duplicate the template demo to your account: https://huggingface.co/spaces/course-demos/whisper-small?duplicate=true

We recommend giving your space a similar name to your fine-tuned model (e.g. whisper-small-dv-demo) and setting the visibility to "Public".

Once you've duplicated the Space to your account, click "Files and versions" -> "app.py" -> "edit". Then change the
model identifier to your fine-tuned model (line 6). Scroll to the bottom of the page and click "Commit changes to main".
he demo will reboot, this time using your fine-tuned model. You can share this demo with your friends and family so that
they can use the model that you've trained!

Checkout our video tutorial to get a better understanding of how to duplicate the Space 👉️ [YouTube Video](https://www.youtube.com/watch?v=VQYuvl6-9VE)

We look forward to seeing your demos on the Hub!
Loading

0 comments on commit 0245987

Please sign in to comment.