What is the performance of WhisperSpeech? #81

kurianbenoy-sentient · 2024-02-05T13:23:57Z

I was trying to understand the performance of WhisperSpeech in TTS and Voice cloning.

Is there any results available as benchmarks or paper to compare the performance of WhisperSpeech project with respect to other project like OpenVoice and Spear-TTS.

Thanks for creating this awesome library. I really liked this project compared to OpenVoice in my initial analysis :)

zoq · 2024-02-05T17:54:01Z

Benchmarking different TTS models is challenging, since we don't really have a metric to measure the audio quality. What we can do is provide samples, to compare the different models. But not sure this is what you are looking for?

BBC-Esq · 2024-02-05T19:35:39Z

On my RTX 4090 I did some basic tests in terms of memory usage, and the quality was about the same as Bark, so maybe that'll help a little.

#68 (comment)

If memory serves, the "tiny" WhisperSpeech model was a little faster than even the smallest Bark model, but overall they were very comparable in terms of quality of recognizing words to speak and the voices themselves are so close to Bark it's hard to distinguish. So if WhisperSpeech continues to progress like I think it will, I see it surpassing Bark and the other options out there.

Also, I tested Coqui and a few others over the weekend and all of their voices are inferior so...I see the best open source ones being Bark and WhispersSpeech. Not referring to proprietary ones of course (Hey Siri!).

I tried multiple models and voices with this and none produced as high quality as Bark or WhisperSpeech, but many were much, much, much faster...but again, you'll get an electronic-sounding, computer-sounding...etc. voice.

https://github.com/coqui-ai/TTS

kurianbenoy-sentient · 2024-02-06T12:47:50Z

I was looking on two aspects mainly:

Overall TTS quality(compared to Coqui-ai TTS, Bark AI, Openvoice etc.)
Voice cloning quality (compared to Openvoice and this project)

It looks the only way is comparing with voice sample and then identifying this is better. Actually metric for TTS which MOS is also the score humans assign based on audio quality.

jpc · 2024-02-06T12:54:09Z

Manual listening tests with MOS seems to be the only reliable metric right now. Could be an interesting community project to make a leaderboard for TTS models with crowdsourced scoring.

BBC-Esq · 2024-02-06T13:57:52Z

Yeah, and it'd be hard though because audio is much more subjective...

The voice cloning seems subjective to a certain extent, but I suppose you could try to prove it by examining the spectrograms or wavegrams of the audio to see if they have similar structures...But still I think it's partially subjective.

Best approach IMHO, have a simple survey of people regarding the voice cloning aspect and the quality of non-cloned voices. Speed should be measurable as long as apples to apples comparisons are done (e.g. using the same beam size/quantization level, etc.)

BBC-Esq · 2024-04-13T19:21:42Z

Congratulations on the new fast-small t2s model. Here's the updated benchmarks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the performance of WhisperSpeech? #81

What is the performance of WhisperSpeech? #81

kurianbenoy-sentient commented Feb 5, 2024

zoq commented Feb 5, 2024

BBC-Esq commented Feb 5, 2024 •

edited

kurianbenoy-sentient commented Feb 6, 2024

jpc commented Feb 6, 2024

BBC-Esq commented Feb 6, 2024

BBC-Esq commented Apr 13, 2024

What is the performance of WhisperSpeech? #81

What is the performance of WhisperSpeech? #81

Comments

kurianbenoy-sentient commented Feb 5, 2024

zoq commented Feb 5, 2024

BBC-Esq commented Feb 5, 2024 • edited

kurianbenoy-sentient commented Feb 6, 2024

jpc commented Feb 6, 2024

BBC-Esq commented Feb 6, 2024

BBC-Esq commented Apr 13, 2024

BBC-Esq commented Feb 5, 2024 •

edited