Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unclear librispeech data prepare scripts for owsm_v1/s2t1 #5686

Open
anonymoussky opened this issue Feb 29, 2024 · 3 comments
Open

unclear librispeech data prepare scripts for owsm_v1/s2t1 #5686

anonymoussky opened this issue Feb 29, 2024 · 3 comments
Labels
Bug bug should be fixed OWSM Open Whisper-style Speech Model

Comments

@anonymoussky
Copy link

anonymoussky commented Feb 29, 2024

There are several lines of codes unclear in espnet/egs2/owsm_v1/s2t1/local/prepare_librispeech.py
Is there a more accurate script to prepare librispeech for the owsm_v1 training?

FileNotFoundError: [Errno 2] No such file or directory: '/espnet/egs2/librispeech/asr1/downloads/mp3/1272/135031/1272-135031.sents.seg.txt'
preparing librispeech failed

e.g.,:

  1. the original librispeech is in "flac" format, should we convert them into "mp3" format first?
  2. {speaker}-{chapter.name}.sents.seg.txt does not exist in the original librispeech corpus
  3. "{speaker}-{chapter.name}.sents.trans.txt" is also not matched to the original trans format, e.g., "1272-128104.trans.txt"
    '''
    for chapter in (data_dir / "mp3" / speaker).iterdir():
    if chapter.is_dir():
    utts = []
    audio = str((chapter / f"{chapter.name}.mp3").resolve())
    with open(
    chapter / f"{speaker}-{chapter.name}.sents.seg.txt", "r"
    ) as seg_f, open(
    chapter / f"{speaker}-{chapter.name}.sents.trans.txt", "r"
    '''
@anonymoussky anonymoussky added the Bug bug should be fixed label Feb 29, 2024
@sw005320 sw005320 added the OWSM Open Whisper-style Speech Model label Feb 29, 2024
@sw005320
Copy link
Contributor

@pyf98, can you answer it?

@pyf98
Copy link
Collaborator

pyf98 commented Feb 29, 2024

Hi, thanks for the question!

For LibriSpeech, I do not use the standard segmented version. Instead, I used the "original-mp3". I believe this is released along with the segmented version. You might need to check the original source of the LibriSpeech distribution.

Here are some paragraphs of the README file in LibriSpeech.

2. Structure
============

The corpus is split into several parts to enable users to selectively download
subsets of it, according to their needs. The subsets with "clean" in their name
are supposedly "cleaner"(at least on average), than the rest of the audio and
US English accented. That classification was obtained using very crude automated
means, and should not be considered completely reliable. The subsets are
disjoint, i.e. the audio of each speaker is assigned to exactly one subset.

The parts of the corpus are as follows:

* dev-clean, test-clean - development and test set containing "clean" speech.
* train-clean-100 - training set, of approximately 100 hours of "clean" speech
* train-clean-360 - training set, of approximately 360 hours of "clean" speech
* dev-other, test-other - development and test set, with speech which was
                          automatically selected to be more "challenging" to
                          recognize
* train-other-500 - training set of approximately 500 hours containing speech
                    that was not classified as "clean", for some (possibly wrong)
                    reason
* intro - subset containing only the LibriVox's intro disclaimers for some of the
          readers.
* mp3 - the original MP3-encoded audio on which the corpus is based
* texts - the original Project Gutenberg texts on which the reference transcripts
          for the utterances in the corpus are based.
* raw_metadata - SQLite databases which record various pieces of information about
                 the source text/audio materials used, and the alignment process.
                 (mostly for completeness - probably not very interesting or useful)


2.3 Organization of the "original-mp3" subset
---------------------------------------------

This part contains the original MP3-compressed recordings as downloaded from the
Internet Archive. It is intended to serve as a secure reference "snapshot" for
the original audio chapters, but also to preserve (most of) the information both
about audio, selected for the corpus, and audio that was discarded. I decided to
try make the corpus relatively balanced in terms of per-speaker durations, so
part of the audio available for some of the speakers was discarded. Also for the
speakers in the training sets, only up to 10 minutes of audio is used, to
introduce more speaker diversity during evaluation time. There should be enough
information in the "mp3" subset to enable the re-cutting of an extended
"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed.

The directory hierarchy follows the already familiar pattern. In each
speaker directory there is a file named "utterance_map" which list for each
of the utterances in the corpus, the original "raw" aligned utterance.
In the "header" of that file there are also 2 lines, that show if the
sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the
reader is assigned to a test set) and the maximum allowed duration for
the set to which this speaker was assigned.

Then in the chapter directory, besides the original audio chapter .mp3 file,
there are two sets of ".seg.txt" and ".trans.txt" files. The former contain
the time range(in seconds) for each of the original(that I called "raw" above)
utterances. The latter contains the respective transcriptions. There are two
sets for the two possible segmentations of each chapter. The ".sents"
segmentation is "sentence-aware", that is, we only split on silence intervals
coinciding with (automatically obtained) sentence boundaries in the text.
The other segmentation was derived by allowing splitting on every silence
interval longer than 300ms, which leads to better utilization of the aligned
audio.

@anonymoussky
Copy link
Author

Hi, thanks for the question!

For LibriSpeech, I do not use the standard segmented version. Instead, I used the "original-mp3". I believe this is released along with the segmented version. You might need to check the original source of the LibriSpeech distribution.

Here are some paragraphs of the README file in LibriSpeech.

2. Structure
============

The corpus is split into several parts to enable users to selectively download
subsets of it, according to their needs. The subsets with "clean" in their name
are supposedly "cleaner"(at least on average), than the rest of the audio and
US English accented. That classification was obtained using very crude automated
means, and should not be considered completely reliable. The subsets are
disjoint, i.e. the audio of each speaker is assigned to exactly one subset.

The parts of the corpus are as follows:

* dev-clean, test-clean - development and test set containing "clean" speech.
* train-clean-100 - training set, of approximately 100 hours of "clean" speech
* train-clean-360 - training set, of approximately 360 hours of "clean" speech
* dev-other, test-other - development and test set, with speech which was
                          automatically selected to be more "challenging" to
                          recognize
* train-other-500 - training set of approximately 500 hours containing speech
                    that was not classified as "clean", for some (possibly wrong)
                    reason
* intro - subset containing only the LibriVox's intro disclaimers for some of the
          readers.
* mp3 - the original MP3-encoded audio on which the corpus is based
* texts - the original Project Gutenberg texts on which the reference transcripts
          for the utterances in the corpus are based.
* raw_metadata - SQLite databases which record various pieces of information about
                 the source text/audio materials used, and the alignment process.
                 (mostly for completeness - probably not very interesting or useful)


2.3 Organization of the "original-mp3" subset
---------------------------------------------

This part contains the original MP3-compressed recordings as downloaded from the
Internet Archive. It is intended to serve as a secure reference "snapshot" for
the original audio chapters, but also to preserve (most of) the information both
about audio, selected for the corpus, and audio that was discarded. I decided to
try make the corpus relatively balanced in terms of per-speaker durations, so
part of the audio available for some of the speakers was discarded. Also for the
speakers in the training sets, only up to 10 minutes of audio is used, to
introduce more speaker diversity during evaluation time. There should be enough
information in the "mp3" subset to enable the re-cutting of an extended
"LibriSpeech+" corpus, containing around 150 extra hours of speech, if needed.

The directory hierarchy follows the already familiar pattern. In each
speaker directory there is a file named "utterance_map" which list for each
of the utterances in the corpus, the original "raw" aligned utterance.
In the "header" of that file there are also 2 lines, that show if the
sentence-aware segmentation was used in the LibriSpeech corpus(i.e. if the
reader is assigned to a test set) and the maximum allowed duration for
the set to which this speaker was assigned.

Then in the chapter directory, besides the original audio chapter .mp3 file,
there are two sets of ".seg.txt" and ".trans.txt" files. The former contain
the time range(in seconds) for each of the original(that I called "raw" above)
utterances. The latter contains the respective transcriptions. There are two
sets for the two possible segmentations of each chapter. The ".sents"
segmentation is "sentence-aware", that is, we only split on silence intervals
coinciding with (automatically obtained) sentence boundaries in the text.
The other segmentation was derived by allowing splitting on every silence
interval longer than 300ms, which leads to better utilization of the aligned
audio.

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug bug should be fixed OWSM Open Whisper-style Speech Model
Projects
None yet
Development

No branches or pull requests

3 participants