More TTS architectures #29

Sobsz · 2024-03-01T00:26:55Z

VITS via Piper (works)
~~XTTS-v2 via Coqui (works, but unstable)~~ removed due to being a pain and not worth the effort

~~streaming with StyleTTS~~ not natively supported, will be separate pull request for sentence-based streaming
~~streaming with VITS~~ technically supported but sentence-based, see above
~~streaming with XTTS (works, but not sure if it helps or if it's streaming)~~

matthewkennedy5 · 2024-03-15T18:20:38Z

docker-compose.yml

@@ -5,8 +5,8 @@ services:
    build: .
    command: >
      bash -c "python setup.py develop &&  \
-               mkdir -p models/styletts2  && \
-               aws s3 sync s3://uberduck-models-us-west-2/prototype/styletts2 models/styletts2 && \ 


it looks like this branch is a bit out of date with main, can you run:

git checkout more-tts-archs git pull --rebase origin main <resolve any merge conflicts> git push -f

aye aye captain

matthewkennedy5 · 2024-03-15T18:21:17Z

openduck-py/openduck_py/routers/voice.py

@@ -350,6 +366,7 @@ def _check_for_exceptions(response_task: Optional[asyncio.Task]) -> bool:
            print("response task was cancelled")
        except Exception as e:
            print("response task raised an exception:", e)
+            print(traceback.format_exc(e))


except for the bit where it raises an exception of its own somehow :p

matthewkennedy5 · 2024-03-15T18:24:13Z

openduck-py/openduck_py/voices/piper.py

+        speaker_id=0,
+    )
+    audio = b"".join(audio)
+    audio = torch.frombuffer(audio, dtype=torch.int16).float() / 32767  # TODO silly


should it be / 32768 ? (2^15)
not 32767?

not a big difference though

also whats # TODO silly?

it's silly because i'm undoing the conversion piper does (which uses 32767 btw)

matthewkennedy5 · 2024-03-15T18:24:55Z

openduck-py/openduck_py/voices/piper.py

+    )
+    audio = b"".join(audio)
+    audio = torch.frombuffer(audio, dtype=torch.int16).float() / 32767  # TODO silly
+    audio = resample(audio, model.config.sample_rate, output_sample_rate)


Can we skip this step if the input and output sample rates are the same? (which I think it usually should be if they're both using 24000)

torchaudio has an if-clause for that already https://pytorch.org/audio/stable/_modules/torchaudio/functional/functional.html#resample

Sobsz marked this pull request as ready for review March 14, 2024 23:54

matthewkennedy5 reviewed Mar 15, 2024

View reviewed changes

Sobsz added 6 commits March 15, 2024 18:42

add piper (tested), xtts, streaming

d8e9701

actually commit the new files oops

4853b21

save

3c9e594

xtts should work as of this commit

e458a28

but it's getting yeeted per zach

5e72f61

black

345e2fa

Sobsz force-pushed the more-tts-archs branch from 09187a1 to 345e2fa Compare March 15, 2024 18:45

black

cac55fd

matthewkennedy5 approved these changes Mar 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More TTS architectures #29

More TTS architectures #29

Sobsz commented Mar 1, 2024 •

edited

Loading

matthewkennedy5 Mar 15, 2024

Sobsz Mar 15, 2024

matthewkennedy5 Mar 15, 2024

Sobsz Mar 15, 2024

matthewkennedy5 Mar 15, 2024

Sobsz Mar 15, 2024

matthewkennedy5 Mar 15, 2024

Sobsz Mar 15, 2024

More TTS architectures #29

Are you sure you want to change the base?

More TTS architectures #29

Conversation

Sobsz commented Mar 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Sobsz commented Mar 1, 2024 •

edited

Loading