-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Silero VAD in Batched Mode #936
Conversation
When I used the batch version, I got better transcription results compared to the sequential version. I'm not sure if this is due to pyannote VAD or if there is an additional process in the batch version that improves WER. Have you ever compared Silero VAD with pyannote VAD? By the way, thank you for your contribution to improving faster-whisper. Even though it was a well-discussed and approved PR, anyone is entitled to have their opinion about it, but no one has the right to be rude. |
It is indeed possible to have better results for long-form transcription in batched mode. This is because there is no context passing between batches and it prevents ambiguous text from the previous context passing to the next set of frames for computation. Thanks for your kind words regarding the batched PR. @MahmoudAshraf97 I would suggest adding the numbers with pyannote VAD and silero VAD (WER and the speed-up) for completeness. |
Pyannote model could be superior VAD, but the extra dependency on pyannote and torch is a concern at the moment. |
@zh-plus it can be an option of course, but keeping pyannote will force us to keep pytorch in the requirements which we are trying to remove based on users feedback, i'm trying to think of a structure to make the whole batching thing optional with optional dependencies for those who want it |
Performance numbers added, tests are passing locally but are failing on CI because torchaudio can't find a backend to use since they are not installed after the removal of
|
Thanks for the PR! Could you add your script that exports the Silero V5 model to encoder and decoder ONNX files? Also, why does it help to separate the model into two ONNX sessions for the performance? |
min_speech_duration_ms: int = 250 | ||
onset: float = 0.5 | ||
offset: float = onset - 0.15 | ||
min_speech_duration_ms: int = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe leave these options (threshold, onset, offset) as they were, e.g. not rename them as it would break signature & parameter passing APIs?
Why are you changing min_speech_duration_ms to 0? I think 250ms is a sane default otherwise you may end up with segments that are very small for having speech inside, maybe even empty ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's best to give the users the freedom to tune the parameters as they wish, previously offset
was fixed to threshold - 0.15
, but now users have the option to tune it as they with without having to play with the code internals, it might not be backwards compatible but it's a very minimal change to adapt
as for min_speech_duration_ms
, benchmarks (YT Commons and Librispeech) showed that dropping it from 250 to 0 had minimal positive or no effect on sequential inference, but it had a very positive impact on batched inference as it combines segments differently than the sequential
As for the reason, Silero models in general require the output of the previous sample to give a correct output for the next sample, but the input of the previous sample is only needed in the decoder stage which makes up a small amount of the total computation cost, so by splitting the model to an encoder and decoder and then batching the input to the encoder only, we gain 3X speedup while still generating identical outputs |
Thanks. Out of curiosity did you find those reference implementations elsewhere or did you rewrite them based on JIT'ted or is there a way to automatically generate from JIT'ted models? PS: OK I think you can get the compiled graph from |
I reimplemented it from scratch based on what I could understand from the JITed code and mapped the weights manually using the dictionary, both implementations are within 1e-5 tolerance from the original implementation |
Reverted back to PyAV in #961, once it is merged then this one is merged we can get rid of torch dependency |
Nice. I have also re-implemented numpy version to get rid of torch dependency. But will stick to this for removing the torch in two steps. I will test the memory leakage and report in #961. |
Encounter another error for audio without speech. Not the same one as in #973 Can we just return an empty list in |
should be fixed now |
@hobodrifterdavid can you upload audios that reproduce the two exceptions? |
I don't have the clips on hand. I just added a check to make sure the audio clips I am sending are at least 5s long (it's possible I was requesting transcription of some zero-length files), and I will improve the logging to record what is processing when an error occurs, will let you know if I see the error again. If the passed audio data has zero length, it might be wise to throw a specific error up-front 'Passed audio is zero samples long' etc., if you don't already. |
* add onnx files to manifest * change `merge_segments` to use audio indixes directly
64852b5
to
8011470
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor comments. I have tested Silero on batched version and got similar WER, but the speed is 60% slower compared to previous VAD. This is on a test set of 9 youtube videos with various audio types and a length from 3-13 minutes. With Silero, it is still at least 2x faster than sequential version. With pyannote VAD it was 3.8x faster.
Have you seen this speed difference?
faster_whisper/vad.py
Outdated
def merge_segments(segments_list, vad_options: VadOptions): | ||
curr_end = 0 | ||
seg_idxs = [] | ||
merged_segments = [] | ||
edge_padding = vad_options.speech_pad_ms / 1000 | ||
chunk_length = vad_options.max_speech_duration_s | ||
sampling_rate = 16000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use sampling_rate
as a function argument which defaults to 16000
. Avoid hard coding for sampling rate and such audio related variables.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
400ms edge padding can contain multiple syllables if the start and previous end times are closer (let's say 100ms). Any reason for keeping it 400ms instead of 100ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the distance between two segments is less than 2 * edge_padding
they are merged together, so it's guaranteed that no audio is included twice, I found that the increasing or decreasing the padding value didn't make much difference so I left it as is to account for higher error margin
As for the speedups, I found that both implementations to be almost identical or within measuring error range, my specs are:
i7 12700k
RTX 3070 Ti
32GB Ram
Although even if silero implementation is slightly slower, it's worth it because of the simpler requirements and the increased code reuse
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense for the edge_padding
and agree that Silero makes the codebase lean and easy to maintain. Do you have the audio file you tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I test on the yt commons dataset
pyannote vad:
Evaluating...: 94it [25:32, 16.31s/it]
WER: 13.976
Silero Vad:
Evaluating...: 94it [26:22, 16.83s/it]
WER: 13.756
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add sampling_rate as an argument in merge_segments
function as well and remove hard coded sampling rate (L318)
This PR tries to close the gap between Batched and Sequential versions
Summary of Changes:
transcribe
function as much as I couldWER Comparisons
Batched (
without_timestamps=True
,vad_filter=True
,chunk_length=25
) on Youtube Commons usingdistil-large-v3
:Before: WER: 13.910
After: WER: 13.712
Vad Parameters are not completely tuned, but I don't have the resources to evaluate on multilingual datasets, contributions are welcome