Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the performance of WhisperSpeech? #81

Open
kurianbenoy-sentient opened this issue Feb 5, 2024 · 6 comments
Open

What is the performance of WhisperSpeech? #81

kurianbenoy-sentient opened this issue Feb 5, 2024 · 6 comments

Comments

@kurianbenoy-sentient
Copy link

I was trying to understand the performance of WhisperSpeech in TTS and Voice cloning.

Is there any results available as benchmarks or paper to compare the performance of WhisperSpeech project with respect to other project like OpenVoice and Spear-TTS.

Thanks for creating this awesome library. I really liked this project compared to OpenVoice in my initial analysis :)

@zoq
Copy link
Contributor

zoq commented Feb 5, 2024

Benchmarking different TTS models is challenging, since we don't really have a metric to measure the audio quality. What we can do is provide samples, to compare the different models. But not sure this is what you are looking for?

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 5, 2024

On my RTX 4090 I did some basic tests in terms of memory usage, and the quality was about the same as Bark, so maybe that'll help a little.

#68 (comment)

If memory serves, the "tiny" WhisperSpeech model was a little faster than even the smallest Bark model, but overall they were very comparable in terms of quality of recognizing words to speak and the voices themselves are so close to Bark it's hard to distinguish. So if WhisperSpeech continues to progress like I think it will, I see it surpassing Bark and the other options out there.

Also, I tested Coqui and a few others over the weekend and all of their voices are inferior so...I see the best open source ones being Bark and WhispersSpeech. Not referring to proprietary ones of course (Hey Siri!).

I tried multiple models and voices with this and none produced as high quality as Bark or WhisperSpeech, but many were much, much, much faster...but again, you'll get an electronic-sounding, computer-sounding...etc. voice.

https://github.com/coqui-ai/TTS

@kurianbenoy-sentient
Copy link
Author

I was looking on two aspects mainly:

  1. Overall TTS quality(compared to Coqui-ai TTS, Bark AI, Openvoice etc.)
  2. Voice cloning quality (compared to Openvoice and this project)

It looks the only way is comparing with voice sample and then identifying this is better. Actually metric for TTS which MOS is also the score humans assign based on audio quality.

@jpc
Copy link
Contributor

jpc commented Feb 6, 2024

Manual listening tests with MOS seems to be the only reliable metric right now. Could be an interesting community project to make a leaderboard for TTS models with crowdsourced scoring.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 6, 2024

Yeah, and it'd be hard though because audio is much more subjective...

The voice cloning seems subjective to a certain extent, but I suppose you could try to prove it by examining the spectrograms or wavegrams of the audio to see if they have similar structures...But still I think it's partially subjective.

Best approach IMHO, have a simple survey of people regarding the voice cloning aspect and the quality of non-cloned voices. Speed should be measurable as long as apples to apples comparisons are done (e.g. using the same beam size/quantization level, etc.)

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Apr 13, 2024

Congratulations on the new fast-small t2s model. Here's the updated benchmarks!

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants