Remove Silence in Batched transcription #1297

MahmoudAshraf97 · 2025-05-10T20:43:39Z

The current VAD implementation in batched transcription is only used for segmentation, silence is only removed at segments boundaries, for example, if we have a speech segment from 1 to 3 and another from 9 to 10, the resulting segment will be from 1 to 10, including a large silence period from 3 to 9 which is prone to hallucinations

This PR concatenates speech only until the desirable chunk size is reached and ignoring the silence in-between thus reducing hallucinations in batched mode
Since the speech is condensed, more words per segment should be expected in the transcript, potential slight speedup is also expected due to less segments for long files

TODO: check if the default VAD settings for batched mode are still valid or need a more forgiving settings

Copilot

Pull Request Overview

This PR removes unwanted silence in batched transcription by consolidating speech chunks while excluding silence between segments. Key changes include updating the VAD chunk collection API with a new max_duration parameter and metadata keys ("offset" and "duration"), removing the deprecated merge_segments function, and adjusting transcription logic to use the new chunk metadata format.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
tests/test_transcribe.py	Updated expected segment count from 7 to 6 to reflect the reduced output due to silence removal.
faster_whisper/vad.py	Modified collect_chunks API to use a max_duration parameter and updated metadata keys; removed merge_segments.
faster_whisper/transcribe.py	Adjusted transcription logic to use new "offset" and "duration" metadata and removed merge_segments call.

Comments suppressed due to low confidence (2)

faster_whisper/transcribe.py:127

The transition from 'start_time' to 'offset' and using 'duration' improves consistency in chunk metadata; please ensure that all downstream time-based calculations have been updated accordingly.

duration = chunk_metadata["duration"]

faster_whisper/transcribe.py:411

The removal of the merge_segments call changes the segmentation behavior; confirm that directly using get_speech_timestamps produces the intended speech chunk boundaries.

clip_timestamps = get_speech_timestamps(audio, vad_parameters)

faster_whisper/vad.py

MahmoudAshraf97 · 2025-05-10T20:47:05Z

@Purfview I'd appreciate your review and testing

Jiltseb · 2025-05-12T22:25:28Z

@MahmoudAshraf97 Indeed, it can reduce the hallucinations. However, if each voiced region returned by VAD is treated as a single segment and padded to 30 sec for whisper, we are indeed increasing the number of such segments and effectively not packing in multiple segments in a single GPU. This can ruin efficiency.

We can adjust the merging logic to ensure chunks are filled with as many voiced regions as possible without exceeding the 30-second limit. We will have to maintain a mapping of original segment indices and their timings for later retrieval. This way, we make sure that multiple segments are kept in a single input (30 sec) to the whisper encoder and at the same time avoid current hallucinations.
@MahmoudAshraf97 Thoughts?

MahmoudAshraf97 · 2025-05-13T08:34:56Z

Hello @Jiltseb, this is already the case, we pack regions until the total duration reaches 30s, and after inference, we map the predicted timings to the original ones

Jiltseb

👍 LGTM!

Initial Commit

4cc8127

MahmoudAshraf97 requested a review from Copilot May 10, 2025 20:43

Copilot AI reviewed May 10, 2025

View reviewed changes

faster_whisper/vad.py Show resolved Hide resolved

Jiltseb approved these changes May 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Silence in Batched transcription #1297

Remove Silence in Batched transcription #1297

MahmoudAshraf97 commented May 10, 2025

Copilot AI left a comment

MahmoudAshraf97 commented May 10, 2025

Jiltseb commented May 12, 2025 •

edited

Loading

MahmoudAshraf97 commented May 13, 2025

Jiltseb left a comment

Remove Silence in Batched transcription #1297

Are you sure you want to change the base?

Remove Silence in Batched transcription #1297

Conversation

MahmoudAshraf97 commented May 10, 2025

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

MahmoudAshraf97 commented May 10, 2025

Jiltseb commented May 12, 2025 • edited Loading

MahmoudAshraf97 commented May 13, 2025

Jiltseb left a comment

Choose a reason for hiding this comment

Jiltseb commented May 12, 2025 •

edited

Loading