-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is the performance of WhisperSpeech? #81
Comments
Benchmarking different TTS models is challenging, since we don't really have a metric to measure the audio quality. What we can do is provide samples, to compare the different models. But not sure this is what you are looking for? |
On my RTX 4090 I did some basic tests in terms of memory usage, and the quality was about the same as Bark, so maybe that'll help a little. If memory serves, the "tiny" WhisperSpeech model was a little faster than even the smallest Bark model, but overall they were very comparable in terms of quality of recognizing words to speak and the voices themselves are so close to Bark it's hard to distinguish. So if WhisperSpeech continues to progress like I think it will, I see it surpassing Bark and the other options out there. Also, I tested Coqui and a few others over the weekend and all of their voices are inferior so...I see the best open source ones being Bark and WhispersSpeech. Not referring to proprietary ones of course (Hey Siri!). I tried multiple models and voices with this and none produced as high quality as Bark or WhisperSpeech, but many were much, much, much faster...but again, you'll get an electronic-sounding, computer-sounding...etc. voice. |
I was looking on two aspects mainly:
It looks the only way is comparing with voice sample and then identifying this is better. Actually metric for TTS which MOS is also the score humans assign based on audio quality. |
Manual listening tests with MOS seems to be the only reliable metric right now. Could be an interesting community project to make a leaderboard for TTS models with crowdsourced scoring. |
Yeah, and it'd be hard though because audio is much more subjective... The voice cloning seems subjective to a certain extent, but I suppose you could try to prove it by examining the spectrograms or wavegrams of the audio to see if they have similar structures...But still I think it's partially subjective. Best approach IMHO, have a simple survey of people regarding the voice cloning aspect and the quality of non-cloned voices. Speed should be measurable as long as apples to apples comparisons are done (e.g. using the same beam size/quantization level, etc.) |
I was trying to understand the performance of WhisperSpeech in TTS and Voice cloning.
Is there any results available as benchmarks or paper to compare the performance of WhisperSpeech project with respect to other project like OpenVoice and Spear-TTS.
Thanks for creating this awesome library. I really liked this project compared to OpenVoice in my initial analysis :)
The text was updated successfully, but these errors were encountered: