Add GigaSpeech 2 recipe #1365

yfyeung · 2024-06-28T09:00:29Z

This PR adds a recipe for GigaSpeech 2.
GigaSpeech 2 raw comprises about 30,000 hours of automatically transcribed speech across Thai, Indonesian, and Vietnamese. GigaSpeech 2 refined consists of 10,000 hours of Thai, 6,000 hours each for Indonesian and Vietnamese. GigaSpeech 2 test sets more realistically reflect speech recognition scenarios and mirror the real performance of an ASR system for low-resource languages.

For more details, please visit:
Dataset: https://huggingface.co/datasets/speechcolab/gigaspeech2
Preprint paper: https://arxiv.org/pdf/2406.11546

pzelasko

Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:

lhotse/lhotse/recipes/gigaspeech.py

Lines 96 to 129 in da4d70d

 with RecordingSet.open_writer( 

 output_dir / f"gigaspeech_recordings_{part}.jsonl.gz" 

 ) as rec_writer, SupervisionSet.open_writer( 

 output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz" 

 ) as sup_writer, CutSet.open_writer( 

 output_dir / f"gigaspeech_cuts_{part}.jsonl.gz" 

 ) as cut_writer: 

 for recording, segments in tqdm( 

 parallel_map( 

 parse_utterance, 

 gigaspeech.audios("{" + part + "}"), 

 repeat(gigaspeech.gigaspeech_dataset_dir), 

 num_jobs=num_jobs, 

 ), 

 desc="Processing GigaSpeech JSON entries", 

 ): 

 # Fix and validate the recording + supervisions 

 recordings, segments = fix_manifests( 

 recordings=RecordingSet.from_recordings([recording]), 

 supervisions=SupervisionSet.from_segments(segments), 

 ) 

 validate_recordings_and_supervisions( 

 recordings=recordings, supervisions=segments 

 ) 

 # Create the cut since most users will need it anyway. 

 # There will be exactly one cut since there's exactly one recording. 

 cuts = CutSet.from_manifests( 

 recordings=recordings, supervisions=segments 

 ) 

 # Write the manifests 

 rec_writer.write(recordings[0]) 

 for s in segments: 

 sup_writer.write(s) 

 cut_writer.write(cuts[0])

yfyeung · 2024-07-03T16:32:02Z

Thanks!! The recipe looks good to me, although I have one suggestion. If you could re-use the streaming manifest writing mechanism from GigaSpeech 1 recipe, it would allow users to prepare this dataset with minimal memory usage. As-is, it will take a lot of CPU memory to hold the entire manifest in memory before writing it to disk. See:

lhotse/lhotse/recipes/gigaspeech.py

Lines 96 to 129 in da4d70d

with RecordingSet.open_writer(

output_dir / f"gigaspeech_recordings_{part}.jsonl.gz"

) as rec_writer, SupervisionSet.open_writer(

output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz"

) as sup_writer, CutSet.open_writer(

output_dir / f"gigaspeech_cuts_{part}.jsonl.gz"

) as cut_writer:

for recording, segments in tqdm(

parallel_map(

parse_utterance,

gigaspeech.audios("{" + part + "}"),

repeat(gigaspeech.gigaspeech_dataset_dir),

num_jobs=num_jobs,

),

desc="Processing GigaSpeech JSON entries",

):

# Fix and validate the recording + supervisions

recordings, segments = fix_manifests(

recordings=RecordingSet.from_recordings([recording]),

supervisions=SupervisionSet.from_segments(segments),

)

validate_recordings_and_supervisions(

recordings=recordings, supervisions=segments

)

# Create the cut since most users will need it anyway.

# There will be exactly one cut since there's exactly one recording.

cuts = CutSet.from_manifests(

recordings=recordings, supervisions=segments

)

# Write the manifests

rec_writer.write(recordings[0])

for s in segments:

sup_writer.write(s)

cut_writer.write(cuts[0])

Sure, I will implement this later.

yfyeung added 4 commits June 28, 2024 01:54

add recipe for gigaspeech2

be42a33

fix flake8 and isort

2cc34ca

remove comments

1d721b4

small fix for train_raw & train_refined

83aef17

pzelasko reviewed Jul 3, 2024

View reviewed changes

pzelasko added this to the v1.25.0 milestone Jul 3, 2024

Merge branch 'lhotse-speech:master' into master

63dce67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GigaSpeech 2 recipe #1365

Add GigaSpeech 2 recipe #1365

yfyeung commented Jun 28, 2024 •

edited

Loading

pzelasko left a comment

yfyeung commented Jul 3, 2024

	with RecordingSet.open_writer(
	output_dir / f"gigaspeech_recordings_{part}.jsonl.gz"
	) as rec_writer, SupervisionSet.open_writer(
	output_dir / f"gigaspeech_supervisions_{part}.jsonl.gz"
	) as sup_writer, CutSet.open_writer(
	output_dir / f"gigaspeech_cuts_{part}.jsonl.gz"
	) as cut_writer:
	for recording, segments in tqdm(
	parallel_map(
	parse_utterance,
	gigaspeech.audios("{" + part + "}"),
	repeat(gigaspeech.gigaspeech_dataset_dir),
	num_jobs=num_jobs,
	),
	desc="Processing GigaSpeech JSON entries",
	):
	# Fix and validate the recording + supervisions
	recordings, segments = fix_manifests(
	recordings=RecordingSet.from_recordings([recording]),
	supervisions=SupervisionSet.from_segments(segments),
	)
	validate_recordings_and_supervisions(
	recordings=recordings, supervisions=segments
	)
	# Create the cut since most users will need it anyway.
	# There will be exactly one cut since there's exactly one recording.
	cuts = CutSet.from_manifests(
	recordings=recordings, supervisions=segments
	)
	# Write the manifests
	rec_writer.write(recordings[0])
	for s in segments:
	sup_writer.write(s)
	cut_writer.write(cuts[0])

Add GigaSpeech 2 recipe #1365

Are you sure you want to change the base?

Add GigaSpeech 2 recipe #1365

Conversation

yfyeung commented Jun 28, 2024 • edited Loading

pzelasko left a comment

Choose a reason for hiding this comment

yfyeung commented Jul 3, 2024

yfyeung commented Jun 28, 2024 •

edited

Loading