You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We successfully trained a tiny S2A model on an en+pl+fr dataset and it can do voice cloning in French:
fr-voice-clone-2.mp4fr-voice-clone-1.mp4
We were able to do this with frozen semantic tokens that were only trained on English and Polish. This supports the idea that we will be able to train a single semantic token model to support all the languages in the world. Quite likely even ones that are not currently well supported by the Whisper model. Stay tuned for more updates on this front. :)
Progress update [2024-01-18]
We spend the last week optimizing inference performance. We integrated torch.compile, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090!
We’ve pushed a new SD S2A model that is a lot faster while still
generating high-quality speech. We’ve also added an example of voice
cloning based on a reference audio file.
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
whisperspeech-sample.mp4
A Polish sample, male voice:
whisperspeech-sample-pl.mp4
2023-07-14
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
we-choose-tts.mp4
Male voice:
we-choose-tts-s467.mp4
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
2023-04-13
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
2023-04-03
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
This discussion was converted from issue #23 on January 09, 2024 12:47.
Heading
Bold
Italic
Quote
Code
Link
Numbered list
Unordered list
Task list
Attach files
Mention
Reference
Menu
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Progress updates (from newest):
Progress update [2024-01-29]
We successfully trained a
tiny
S2A model on an en+pl+fr dataset and it can do voice cloning in French:fr-voice-clone-2.mp4
fr-voice-clone-1.mp4
We were able to do this with frozen semantic tokens that were only trained on English and Polish. This supports the idea that we will be able to train a single semantic token model to support all the languages in the world. Quite likely even ones that are not currently well supported by the Whisper model. Stay tuned for more updates on this front. :)
Progress update [2024-01-18]
We spend the last week optimizing inference performance. We integrated
torch.compile
, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090!We also added an easy way to test voice-cloning. Here is a sample voice cloned from a famous speech by Winston Churchill:
en-cloning.mp4
We can also mix languages in a single sentence (here the highlighted English project names are seamlessly mixed into Polish speech):
pl-en-mix.mp4
You can test all of these on Collab. A Huggingface Space is coming soon.
Progress update [2024-01-10]
We’ve pushed a new SD S2A model that is a lot faster while still
generating high-quality speech. We’ve also added an example of voice
cloning based on a reference audio file.
As always, you can check out our
Colab
to try it yourself!
2023-12-10
Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!
English speech, female voice (transferred from a Polish language dataset):
whisperspeech-sample.mp4
A Polish sample, male voice:
whisperspeech-sample-pl.mp4
2023-07-14
We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.
An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):
Female voice:
we-choose-tts.mp4
Male voice:
we-choose-tts-s467.mp4
We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:
2023-04-13
We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).
End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:
(don't forget to unmute the video)
test-e2e-jfk-T0.7.mp4
Ground truth:
we-choose.mp4
2023-04-03
We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.
Validation set ground truth (don't forget to unmute):
ground-truth.mov
The generated output from the S->A model (multinomial sampling, temperature 0.8):
saar-1300hr-2l-20e-T0.8.mov
Beta Was this translation helpful? Give feedback.
All reactions