Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

Open
zx3777 opened this issue Jul 26, 2024 · 11 comments
Open

1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 #934

zx3777 opened this issue Jul 26, 2024 · 11 comments

Comments

@zx3777
Copy link

zx3777 commented Jul 26, 2024

silero-vad

Large portions of the speech are missing.

Some files have subtitles files of 10kb using version 1.0.2, while only less than 1kb using version 1.0.3.

This video file
https://www.youtube.com/watch?v=tVLOBfzbJV8
resulted in 320 lines of subtitles using version 1.0.2, but only 218 lines using version 1.0.3. Many conversations were not recognized in version 1.0.3.

I only compared Korean, other languages ​​have not been tested yet.

@zx3777
Copy link
Author

zx3777 commented Jul 26, 2024

This is the audio file for the video above.
https://mega.nz/file/QacS2LCJ#x_Gq9GgV8aPk2qRVskfzNBuyM9XAI-Pv2SBIwxfomnk

@zx3777 zx3777 changed the title 1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 in korean 1.0.3 VAD v5 is much worse than 1.0.2 VAD v4 Jul 26, 2024
@x86Gr
Copy link

x86Gr commented Jul 26, 2024

I agree, I also have worse performance, just not as much, however the overall WER for non english speech is going down. Go back to silero or at least let us choose the VAD model

@zx3777
Copy link
Author

zx3777 commented Jul 27, 2024

I agree, I also have worse performance, just not as much, however the . Go back to silero or at least let us choose the VAD model

Version 1.0.3 release still uses silero, but with an upgraded version.
WER going down maybe because the VAD only identifies sufficiently clear speech.

@MahmoudAshraf97
Copy link
Collaborator

@zx3777 that will cause higher WER, a missing word is still an error to count
You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

@zx3777
Copy link
Author

zx3777 commented Jul 27, 2024

@zx3777 that will cause higher WER, a missing word is still an error to count You should try playing with the vad settings and see how it makes a difference, the model was changed but the parameters are still tuned for the previous one

Useless

I tried --vad_threshold 0.4 0.3 0.2 in 1.0.3, and there was a slight improvement, but the recognized subtitles are still much less than in 1.0.2.

@hoonlight
Copy link
Contributor

hoonlight commented Jul 28, 2024

Hi, could you try again with the master branch and let me know the results?

@x86Gr
Copy link

x86Gr commented Jul 29, 2024

I will run the tests on our audio corporas, with different parameters, but it won't be quick

@zx3777
Copy link
Author

zx3777 commented Aug 12, 2024

Hi, could you try again with the master branch and let me know the results?

I tested the master branch version before the upgrade to [New PR for Faster Whisper: Batching Support, Speed Boosts, and Quality Enhancements], and the results were the same.

In my opinion, after the new PR, only the batched version uses a different VAD implementation. The normal version still uses the VAD from 1.03, so the results should be the same.

@hoonlight
Copy link
Contributor

hoonlight commented Aug 13, 2024

Thanks for the test @zx3777 , I suspect this is a issue with the model itself.
There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

@MahmoudAshraf97
Copy link
Collaborator

Thanks for the test @zx3777 , I suspect this is a issue with the model itself. There hasn't been enough quantitative evaluation of the silero-vad v5, but at least we can make it possible for users to choose silero-vad v4 instead of silero-vad v5 based on their needs.

I'll open a PR after the issues related to this discussion are well finalized.

I already wrote the code, but waiting for #936 to be merged so we can discuss having both or just reverting to V4

@George0828Zhang
Copy link

George0828Zhang commented Oct 26, 2024

Just chiming in and adding a case where old (not sure if it's v3 or v4) version outperforms v5:
https://drive.google.com/file/d/1NPvEybP0VU1dFmd6neH6JJRW_Qm2MXdk/view?usp=sharing

code:

from pprint import pprint
from faster_whisper.audio import decode_audio
from faster_whisper.vad import VadOptions, get_speech_timestamps

speech_chunks = get_speech_timestamps(decode_audio('ja_example.wav'))
pprint(speech_chunks)

old:

[{'end': 40192, 'start': 12032},
 {'end': 179456, 'start': 76544},
 {'end': 379136, 'start': 273152},
 {'end': 457984, 'start': 422656},
 {'end': 630016, 'start': 576256},
 {'end': 669952, 'start': 653056},
 {'end': 863488, 'start': 695040},
 {'end': 950528, 'start': 896768}]

v5:

[{'end': 30464, 'start': 12032}]

Apparently cartoony voices are ignored in v5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants