Skip to content

Remove Silence in Batched transcription #1297

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

MahmoudAshraf97
Copy link
Collaborator

The current VAD implementation in batched transcription is only used for segmentation, silence is only removed at segments boundaries, for example, if we have a speech segment from 1 to 3 and another from 9 to 10, the resulting segment will be from 1 to 10, including a large silence period from 3 to 9 which is prone to hallucinations

This PR concatenates speech only until the desirable chunk size is reached and ignoring the silence in-between thus reducing hallucinations in batched mode
Since the speech is condensed, more words per segment should be expected in the transcript, potential slight speedup is also expected due to less segments for long files

TODO: check if the default VAD settings for batched mode are still valid or need a more forgiving settings

@MahmoudAshraf97 MahmoudAshraf97 requested a review from Copilot May 10, 2025 20:43
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes unwanted silence in batched transcription by consolidating speech chunks while excluding silence between segments. Key changes include updating the VAD chunk collection API with a new max_duration parameter and metadata keys ("offset" and "duration"), removing the deprecated merge_segments function, and adjusting transcription logic to use the new chunk metadata format.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
tests/test_transcribe.py Updated expected segment count from 7 to 6 to reflect the reduced output due to silence removal.
faster_whisper/vad.py Modified collect_chunks API to use a max_duration parameter and updated metadata keys; removed merge_segments.
faster_whisper/transcribe.py Adjusted transcription logic to use new "offset" and "duration" metadata and removed merge_segments call.
Comments suppressed due to low confidence (2)

faster_whisper/transcribe.py:127

  • The transition from 'start_time' to 'offset' and using 'duration' improves consistency in chunk metadata; please ensure that all downstream time-based calculations have been updated accordingly.
duration = chunk_metadata["duration"]

faster_whisper/transcribe.py:411

  • The removal of the merge_segments call changes the segmentation behavior; confirm that directly using get_speech_timestamps produces the intended speech chunk boundaries.
clip_timestamps = get_speech_timestamps(audio, vad_parameters)

@MahmoudAshraf97
Copy link
Collaborator Author

@Purfview I'd appreciate your review and testing

@Jiltseb
Copy link
Contributor

Jiltseb commented May 12, 2025

@MahmoudAshraf97 Indeed, it can reduce the hallucinations. However, if each voiced region returned by VAD is treated as a single segment and padded to 30 sec for whisper, we are indeed increasing the number of such segments and effectively not packing in multiple segments in a single GPU. This can ruin efficiency.

We can adjust the merging logic to ensure chunks are filled with as many voiced regions as possible without exceeding the 30-second limit. We will have to maintain a mapping of original segment indices and their timings for later retrieval. This way, we make sure that multiple segments are kept in a single input (30 sec) to the whisper encoder and at the same time avoid current hallucinations.
@MahmoudAshraf97 Thoughts?

@MahmoudAshraf97
Copy link
Collaborator Author

Hello @Jiltseb, this is already the case, we pack regions until the total duration reaches 30s, and after inference, we map the predicted timings to the original ones

Copy link
Contributor

@Jiltseb Jiltseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants