All progress updates #37

jpc · 2023-07-18T12:00:25Z

jpc
Jul 18, 2023

Progress updates (from newest):

Progress update [2024-01-29]

We successfully trained a tiny S2A model on an en+pl+fr dataset and it can do voice cloning in French:

fr-voice-clone-2.mp4

fr-voice-clone-1.mp4

We were able to do this with frozen semantic tokens that were only trained on English and Polish. This supports the idea that we will be able to train a single semantic token model to support all the languages in the world. Quite likely even ones that are not currently well supported by the Whisper model. Stay tuned for more updates on this front. :)

Progress update [2024-01-18]

We spend the last week optimizing inference performance. We integrated torch.compile, added kv-caching and tuned some of the layers – we are now working over 12x faster than real-time on a consumer 4090!

We also added an easy way to test voice-cloning. Here is a sample voice cloned from a famous speech by Winston Churchill:

en-cloning.mp4

We can also mix languages in a single sentence (here the highlighted English project names are seamlessly mixed into Polish speech):

To jest pierwszy test wielojęzycznego Whisper Speech modelu zamieniającego tekst na mowę, który Collabora i Laion nauczyli na superkomputerze Jewels.

pl-en-mix.mp4

You can test all of these on Collab. A Huggingface Space is coming soon.

Progress update [2024-01-10]

We’ve pushed a new SD S2A model that is a lot faster while still
generating high-quality speech. We’ve also added an example of voice
cloning based on a reference audio file.

As always, you can check out our
Colab
to try it yourself!

2023-12-10

Another trio of models, this time they support multiple languages (English and Polish). Here are two new samples for a sneak peek. You can check out our Colab to try it yourself!

English speech, female voice (transferred from a Polish language dataset):

whisperspeech-sample.mp4

A Polish sample, male voice:

whisperspeech-sample-pl.mp4

2023-07-14

We have trained a new pair of models, added support for multiple speakers and integrated the Vocos vocoder to deliver a big overall quality boost. And this is not even our last word because we are doing hyperparameter tuning to train bigger, higher-quality models.

An end to end generation example, inspired by one famous president's speech (don't forget to unmute the videos):

Female voice:

we-choose-tts.mp4

Male voice:

we-choose-tts-s467.mp4

We have streamlined the inference pipeline and you can now test the model yourself on Google Colab:

2023-04-13

We have trained a preliminary T->S model and a new 3kbps S->A model which improves the speech quality. Both models are far from perfect yet but we are clearly moving in the right direction (to the moon 🚀🌖!).

End-to-end TTS model with ≈ 6% WER (both T->S and S->A sampled with simple multinomial sampling at T = 0.7, no beam search) see #9 for more details:

(don't forget to unmute the video)

test-e2e-jfk-T0.7.mp4

Ground truth:

we-choose.mp4

2023-04-03

We have trained a working S->A model. It does not sound amazing but that is mostly because of EnCodec quality at 1.5kbps.

Validation set ground truth (don't forget to unmute):

ground-truth.mov

The generated output from the S->A model (multinomial sampling, temperature 0.8):

saar-1300hr-2l-20e-T0.8.mov

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

All progress updates #37

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

All progress updates #37

jpc Jul 18, 2023

Progress update [2024-01-29]

Progress update [2024-01-18]

Progress update [2024-01-10]

2023-12-10

2023-07-14

2023-04-13

2023-04-03

Replies: 0 comments

jpc
Jul 18, 2023