Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error: Adding speaker on Win 11 #72

Open
tin2tin opened this issue Feb 3, 2024 · 13 comments
Open

Error: Adding speaker on Win 11 #72

tin2tin opened this issue Feb 3, 2024 · 13 comments

Comments

@tin2tin
Copy link

tin2tin commented Feb 3, 2024

Trying to add speaker to the test file throws an error on Windows 11.

from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-tiny-en+pl.model')
# pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-base-en+pl.model')

audio_tensor = pipe.generate("""
 This is some sample text.  You would add text here that you want spoken and then only leave one of the above lines ununcommented for the model you want to test.  Note that this script does not rely on the standard method within the whisperspeech pipeline.  Rather, it replaces a part of the functionality with reliance on pydub instead.  This approach "just worked."  Feel free to modify or distribute at your pleasure.
""", speaker='https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg', lang='en', cps=14)

# generate uses CUDA if available; therefore, it's necessary to move to CPU before converting to NumPy array
audio_np = (audio_tensor.cpu().numpy() * 32767).astype(np.int16)

if len(audio_np.shape) == 1:
    audio_np = np.expand_dims(audio_np, axis=0)
else:
    audio_np = audio_np.T

print("Array shape:", audio_np.shape)
print("Array dtype:", audio_np.dtype)

try:
    audio_segment = AudioSegment(
        audio_np.tobytes(), 
        frame_rate=24000, 
        sample_width=2, 
        channels=1
    )
    audio_segment.export('output_audio.wav', format='wav')
    print("Audio file generated: output_audio.wav")
except Exception as e:
    print(f"Error writing audio file: {e}")

And this is the error:

Traceback (most recent call last):
  File "...untitled_37.blend\Text.001", line 9, in <module>
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 87, in generate
    return self.vocoder.decode(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=step_callback))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 80, in generate_atoks
    elif isinstance(speaker, (str, Path)): speaker = self.extract_spk_emb(speaker)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 73, in extract_spk_emb
    samples, sr = torchaudio.load(fname)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\utils.py", line 205, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\soundfile.py", line 740, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\soundfile.py", line 1264, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "...\python\Lib\site-packages\soundfile.py", line 1455, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg': System error.
Error: Python: Traceback (most recent call last):
  File "...untitled_37.blend\Text.001", line 9, in <module>
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 87, in generate
    return self.vocoder.decode(self.generate_atoks(text, speaker, lang=lang, cps=cps, step_callback=step_callback))
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 80, in generate_atoks
    elif isinstance(speaker, (str, Path)): speaker = self.extract_spk_emb(speaker)
                                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\whisperspeech\pipeline.py", line 73, in extract_spk_emb
    samples, sr = torchaudio.load(fname)
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\utils.py", line 205, in load
    return backend.load(uri, frame_offset, num_frames, normalize, channels_first, format, buffer_size)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\soundfile.py", line 27, in load
    return soundfile_backend.load(uri, frame_offset, num_frames, normalize, channels_first, format)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\torchaudio\_backend\soundfile_backend.py", line 221, in load
    with soundfile.SoundFile(filepath, "r") as file_:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\soundfile.py", line 740, in __init__
    self._file = self._open(file, mode_int, closefd)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\python\Lib\site-packages\soundfile.py", line 1264, in _open
    _error_check(_snd.sf_error(file_ptr),
  File "...\python\Lib\site-packages\soundfile.py", line 1455, in _error_check
    raise RuntimeError(prefix + _ffi.string(err_str).decode('utf-8', 'replace'))
RuntimeError: Error opening 'https://upload.wikimedia.org/wikipedia/commons/7/75/Winston_Churchill_-_Be_Ye_Men_of_Valour.ogg': System error.

@tin2tin
Copy link
Author

tin2tin commented Feb 3, 2024

Is it the .ogg format it can't read?

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 3, 2024

Same type of error that I received on windows 10 if I recall.

@tin2tin
Copy link
Author

tin2tin commented Feb 3, 2024

If I change it to a wav file, it works but that wall of messages is still there.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 3, 2024

If I change it to a wav file, it works but that wall of messages is still there.

Have you verified that the ultimate audio file does in fact sound like the speaker that you tried to specify?

@tin2tin
Copy link
Author

tin2tin commented Feb 3, 2024

Yes, it does work. Here is first Orson Wells and then Werner Herzog:

9739-10034.mp4

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 4, 2024

Cool, can you please post the script you used again? I still didn't get it working for some strange reason.

@tin2tin
Copy link
Author

tin2tin commented Feb 4, 2024

from whisperspeech.pipeline import Pipeline

pipe = Pipeline(s2a_ref='collabora/whisperspeech:s2a-q4-small-en+pl.model')
pipe.generate_to_file(filename, prompt, speaker="path to local wav file", lang='en', cps=14)

@jpc
Copy link
Contributor

jpc commented Feb 13, 2024

I have no opinion on any particular library for sound files but I think:

  1. We want this to work out of the box on Windows and Linux/Mac
  2. We probably want to support loading arbitrary mp3/ogg file for the reference speech
  3. Loading directly from URLs is also nice for example code, maybe we can implement a simple wrapper ourselves on top of requests if torchaudio is not reliable on other platforms?
  4. It would be nice to have one single dependency for loading and saving files for inference (training can use a lot more stuff since it does not put a burden on every user)

Do we all agree that this would be a good list of requirements?

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 13, 2024

I have no opinion on any particular library for sound files but I think:

1. We want this to work out of the box on Windows and Linux/Mac

2. We probably want to support loading arbitrary mp3/ogg file for the reference speech

3. Loading directly from URLs is also nice for example code, maybe we can implement a simple wrapper ourselves on top of `requests` if `torchaudio` is not reliable on other platforms?

4. It would be nice to have one single dependency for loading and saving files for inference (training can use a lot more stuff since it does not put a burden on every user)

Do we all agree that this would be a good list of requirements?

My only thoughts are regarding numbers three and four above regarding "torchaudio".

I originally was the one getting errors when using torchaudio on Windows that I think is the impetus for your message about possibly switching out torchaudio. To bring you up to speed...eventually I WAS able to resolve the torchaudio-related error we discussed. This required using pip install sounddevice. It's my understanding that torchaudio uses either sounddevice or "iosox" (not sure how to spell) under the hood...and for some reason I had to pip install sounddevice.

This fully resolved the error regarding torchaudio. Thus, as long as it's clear that windows users (and possibly other platforms, I don't know?) might have to specifically install sounddevice, we shouldn't need to abandon torchaudio....that is, unless other users encounter different problems.

Overall, I was NOT able to get a good tensor of a speaker's voice despite using very high quality audio, and I haven't re-tried since then. However, that's a separate issue and might involve my code...but as far as torchaudio and windows compatibility, hope this clarifies.

@jpc
Copy link
Contributor

jpc commented Feb 18, 2024

Hey, thanks for explaining this. So it would look like we should check if we can add soundfile to our dependencies to make sure it gets installed on Windows.

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 18, 2024

I would recommend adding it for Linux and Macos users as well. Apparently @signalprime encountered an error stating that torchaudio backend wasn't available, which was solved by pip installing soundfile. I believe his pull request might already do this.

My comment above was half misinformed...I think I confused "soundfile" and "sounddevice" because the names are similar. Thus, to clarify my comment...I had to separately pip install soundfile, not sounddevice to get the script to work.

As long as soundfile works on windows, mac, and linux, which it looks like it does, I'd recommend setting the default backend to soundfile instead of letting torchaudio automatically select it:

https://pytorch.org/audio/2.0.0/backend.html

My comments in @signalprime's pull request discuss this, but basically, soundfile is supposed to work on all three platforms whereas sox_io is only linux/mac. I researched the issue and this is a renmant of early pytorch versions. sox_io is located on sourceforge somewhere I believe...and then soundfile came around and was a hug improvement...But for some reason sox_io is still an option. Soundfile is updated constantly and recently, whereas I can't find any information on sox_io...even on sourceforge. Furthermore, torchaudio's github states confusingly that the "optional" dependencies are kaldi and soundfile...but not sox_io and soundfile like on their website's instructions...

https://github.com/pytorch/audio/blob/main/requirements.txt

Anyways, basically, to avoid my and @signalprime's errors, I'd recommend explicitly setting soundfile as the default torchaudio backend somewhere in the sourcecode.

If you agree...here are the installation instructions:

https://python-soundfile.readthedocs.io/en/latest/

Notice that they're somewhat different for Linux...so you may need to add an additional dependency for linux users named libsndfile:

image

@BBC-Esq
Copy link
Contributor

BBC-Esq commented Feb 19, 2024

Finally found the website for sox...apparently it hasn't been updated since 2015 so I'm not sure why in the world Pytorch would use it. I say set the default to SoundDevice... :-)

https://sourceforge.net/projects/sox/files/sox/

@jpc
Copy link
Contributor

jpc commented Feb 19, 2024

Hey, nice job finding all that info. I was not aware of the history behind the torchaudio backends!

I agree with your conclusion - I’ll test a fresh Linux install and if you can confirm that torchaudio with soundfile works well on Windows then we can use this exclusively.

I also implemented a workaround to support downloading the file from a URL, even without Sox.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants