Support Text to Speech #209

zolrath · 2023-05-09T20:00:34Z

Hello!
As Speech to Text models such as Whisper are added having access to some of the impressive AI Text to Speech models would be a nice way to close the loop!

My current suggestion for a model to support would be bark.

fredwu · 2023-05-11T07:50:14Z

+1

Would also love to see the support for Coqui TTS.

bartekupartek · 2024-01-24T19:02:20Z

It would be great to run Bark in Elixir, also recently this TTS model brought a lot of attention https://github.com/collabora/WhisperSpeech

Jdyn · 2024-03-18T20:59:27Z

I hate to reiterate what's already been said but TTS in Bumblebee using Bark would be super valuable. Any chance of supporting it?

Hugging face: https://huggingface.co/suno/bark

josevalim · 2024-03-18T21:01:03Z

Pull requests are always welcome. Starting with one of the models in Hugging Face Transformers is probably the easiest way to get started: https://huggingface.co/docs/transformers/en/tasks/text-to-speech

nickkaltner · 2024-03-29T04:07:53Z

Just adding this as an interesting model to support too https://huggingface.co/coqui/XTTS-v2

bartekupartek · 2024-04-05T22:24:02Z

I tried to port Bark and later on WhisperSpeech, they use multiple models to convert text to semantics, semantics to audio and encode... anyway there are more promising models recently released https://huggingface.co/parler-tts/parler_tts_mini_v0.1 or
https://github.com/jasonppy/VoiceCraft
or https://github.com/myshell-ai/OpenVoice
After reviewing their architectures they might be easier to integrate

michelson · 2024-04-08T04:32:09Z

@bartekupartek, do you have your implementation open? I'm trying to do the same I've read the docs but not sure where to start.

bartekupartek · 2024-04-10T09:06:37Z

@michelson not yet but working on it, this models aren't using standard layers or if at all they are in pickle format, I needed to move back to understand simpler models with axon first

bartekupartek · 2024-04-23T22:19:27Z

I'm currently playing around Tacotron 2 text-to-speech and since it's simplest TTS I've found I'm trying to reproduce it in Elixir, I used nx_signal to process audio files and generate Mel spectrograms but during my research I noticed there is no support for a vocoder in Elixir ecosystem to convert spectrograms back to audio or am I missing something?
Vocoders are typically another models so I think they could be integrated in bumblebee. I found all TTS models are utilizing vocoders to encode audio from theirs outputs, but they are yet another layer of complexity.

josevalim · 2024-04-24T07:22:49Z

Correct. We would need to implement them in Elixir. Maybe @polvalente knows of an implementation that could be ported, otherwise we need to look if there are any Jax implementations. If not, maybe it needs to be a separate library we invoke.

polvalente · 2024-04-24T10:12:47Z

There are many kinds of vocoders. I think the best way to approach this would be to choose a specific model we want to support and work towards porting the one it uses.

bartekupartek · 2024-04-24T12:08:39Z

I was thinking it might be one of torchaudio vocoders like Griffin-Lim(outputs sounds robotic) or WaveRNN(most likely this) or Nvidia Waveglow to turn mel spectograms into audio, but I just read trough VALL-E paper Bark is based on:

We propose VALL-E, the first TTS framework with strong in-context learning capabilities as
GPT-3, which treats TTS as a language model task with audio codec codes as an intermediate
representation to replace the traditional mel spectrogram

It would be fun to have Tacotron 2 working end to end or hear how mel spectrograms sounds but it looks like it doesn't make sense for any recent models mentioned above that are using facebook/encodec to turn outputs into audio codes directly 🙇‍♂️

jonatanklosko added the kind:feature New feature or request label Dec 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Text to Speech #209

Support Text to Speech #209

zolrath commented May 9, 2023

fredwu commented May 11, 2023

bartekupartek commented Jan 24, 2024 •

edited

Jdyn commented Mar 18, 2024

josevalim commented Mar 18, 2024

nickkaltner commented Mar 29, 2024

bartekupartek commented Apr 5, 2024 •

edited

michelson commented Apr 8, 2024

bartekupartek commented Apr 10, 2024 •

edited

bartekupartek commented Apr 23, 2024 •

edited

josevalim commented Apr 24, 2024

polvalente commented Apr 24, 2024

bartekupartek commented Apr 24, 2024 •

edited

Support Text to Speech #209

Support Text to Speech #209

Comments

zolrath commented May 9, 2023

fredwu commented May 11, 2023

bartekupartek commented Jan 24, 2024 • edited

Jdyn commented Mar 18, 2024

josevalim commented Mar 18, 2024

nickkaltner commented Mar 29, 2024

bartekupartek commented Apr 5, 2024 • edited

michelson commented Apr 8, 2024

bartekupartek commented Apr 10, 2024 • edited

bartekupartek commented Apr 23, 2024 • edited

josevalim commented Apr 24, 2024

polvalente commented Apr 24, 2024

bartekupartek commented Apr 24, 2024 • edited

bartekupartek commented Jan 24, 2024 •

edited

bartekupartek commented Apr 5, 2024 •

edited

bartekupartek commented Apr 10, 2024 •

edited

bartekupartek commented Apr 23, 2024 •

edited

bartekupartek commented Apr 24, 2024 •

edited