Problems in concatenate_dataset #129

George0828Zhang · 2024-05-01T02:51:47Z

In concatenate_dataset():

distil-whisper/training/run_pseudo_labelling.py

Lines 644 to 671 in 66ac8dd

 for idx in range(1, len(audio)): 

 prev_speaker = speaker_id[idx - 1] 

 speaker = speaker_id[idx] 

 if len(audio_sample) + input_lengths[idx] < max_input_length: 

 if speaker == prev_speaker: 

 # we have no information about whether the segments follow on sequentially 

 # so we just ensure the same speaker as we concatenate across files 

 audio_sample = np.append(audio_sample, audio[idx]) 

 # extra spaces in the text transcription don't matter, since we only use it for the WER computation 

 text_sample += " " + text[idx] 

 else: 

 # speakers do not follow sequentially, save the audio and start looping again 

 concatenated_audio.append(audio_sample) 

 concatenated_text.append(text_sample) 

 concatenated_speaker.append(speaker) 

 condition_on_prev.append(0) 

 audio_sample = audio[idx] 

 text_sample = text[idx] 

 else: 

 # concatenated audio exceeds max length, save the audio and start looping again 

 concatenated_audio.append(audio_sample) 

 concatenated_text.append(text_sample) 

 concatenated_speaker.append(speaker) 

 condition_on_prev.append(1) 

 audio_sample = audio[idx] 

 text_sample = text[idx]

From my understanding, the logic in the for loop is

If either:
1. Adding the current utterance to audio_sample exceeds 30s
2. The current speaker is different from previous (prev_speaker)
Then save the concatenation up to the previous utterance (audio_sample), excluding the current utterance.

Since the concatenated sample does not contain the current utterance, we have:

The appended speaker should be previous_speaker rather than speaker
condition_on_prev signifies continuity at the start of current utterance, so this should be shifted to the right by 1 (e.g. initialize as condition_on_prev = [0])

Meanwhile, it seems that the very last accumulated sample in each batch did not get appended, i.e. when the for loop exits, there will be a (audio_sample, text_sample) pair that is <= 30s which should've been appended but didn't.

These may not seem significant, but when finetuning on custom dataset with diverse speakers, and condition_on_prev is expected to be true alot, it will cause wrongful training signals.

The text was updated successfully, but these errors were encountered:

sanchit-gandhi linked a pull request Jun 12, 2024 that will close this issue

[pseudo-labelling] fix concatenate datasets #138

Merged

eustlb closed this as completed in #138 Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems in concatenate_dataset #129

Problems in concatenate_dataset #129

George0828Zhang commented May 1, 2024

Problems in concatenate_dataset #129

Problems in concatenate_dataset #129

Comments

George0828Zhang commented May 1, 2024