-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Remove Silence in Batched transcription #1297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes unwanted silence in batched transcription by consolidating speech chunks while excluding silence between segments. Key changes include updating the VAD chunk collection API with a new max_duration parameter and metadata keys ("offset" and "duration"), removing the deprecated merge_segments function, and adjusting transcription logic to use the new chunk metadata format.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
tests/test_transcribe.py | Updated expected segment count from 7 to 6 to reflect the reduced output due to silence removal. |
faster_whisper/vad.py | Modified collect_chunks API to use a max_duration parameter and updated metadata keys; removed merge_segments. |
faster_whisper/transcribe.py | Adjusted transcription logic to use new "offset" and "duration" metadata and removed merge_segments call. |
Comments suppressed due to low confidence (2)
faster_whisper/transcribe.py:127
- The transition from 'start_time' to 'offset' and using 'duration' improves consistency in chunk metadata; please ensure that all downstream time-based calculations have been updated accordingly.
duration = chunk_metadata["duration"]
faster_whisper/transcribe.py:411
- The removal of the merge_segments call changes the segmentation behavior; confirm that directly using get_speech_timestamps produces the intended speech chunk boundaries.
clip_timestamps = get_speech_timestamps(audio, vad_parameters)
@Purfview I'd appreciate your review and testing |
@MahmoudAshraf97 Indeed, it can reduce the hallucinations. However, if each voiced region returned by VAD is treated as a single segment and padded to 30 sec for whisper, we are indeed increasing the number of such segments and effectively not packing in multiple segments in a single GPU. This can ruin efficiency. We can adjust the merging logic to ensure chunks are filled with as many voiced regions as possible without exceeding the 30-second limit. We will have to maintain a mapping of original segment indices and their timings for later retrieval. This way, we make sure that multiple segments are kept in a single input (30 sec) to the whisper encoder and at the same time avoid current hallucinations. |
Hello @Jiltseb, this is already the case, we pack regions until the total duration reaches 30s, and after inference, we map the predicted timings to the original ones |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 LGTM!
The current VAD implementation in batched transcription is only used for segmentation, silence is only removed at segments boundaries, for example, if we have a speech segment from 1 to 3 and another from 9 to 10, the resulting segment will be from 1 to 10, including a large silence period from 3 to 9 which is prone to hallucinations
This PR concatenates speech only until the desirable chunk size is reached and ignoring the silence in-between thus reducing hallucinations in batched mode
Since the speech is condensed, more words per segment should be expected in the transcript, potential slight speedup is also expected due to less segments for long files
TODO: check if the default VAD settings for batched mode are still valid or need a more forgiving settings