Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diarization pipeline v3.1 is much slower than 3.0 when running on CPU #1621

Open
a-rogalska opened this issue Jan 17, 2024 · 22 comments
Open

Comments

@a-rogalska
Copy link

a-rogalska commented Jan 17, 2024

Tested versions

Tested on 3.1 vs 3.0

System information

Debian GNU/Linux, torch 2.1.2

Issue description

When running diarization pipeline on CPU, v3.1 is more than 2x slower than v3.0. Is it possible to make it faster?

Minimal reproduction example (MRE)

from pyannote.audio import Pipeline
import time

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.0", use_auth_token=hf_token)
start = time.perf_counter()
diarization = pipeline("sample.wav")
print("\nDiarization on v3.0 took {:.2f} s\n".format(time.perf_counter() - start))

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1", use_auth_token=hf_token)
start = time.perf_counter()
diarization = pipeline("sample.wav")
print("\nDiarization on v3.1 took {:.2f} s\n".format(time.perf_counter() - start))
@ljnicol
Copy link

ljnicol commented Jan 18, 2024

I'm also having this issue. Using the above code I get:

Diarization on v3.0 took 12.75 s

Diarization on v3.1 took 22.36 s

System Information

This is on a Macbook Pro m1, running on the CPU. Torch 2.1.2

@hbredin
Copy link
Member

hbredin commented Jan 18, 2024

Would you mind sharing a Google Colab that I can just click and run?

@a-rogalska
Copy link
Author

Here is a colab link

For a 2 minute audio it took me 115.84s on v3.0.1 and 559.31s on v.3.1.0.

@hbredin
Copy link
Member

hbredin commented Jan 19, 2024

Thanks for taking the time to prepare a notebook. That helps.

  1. Looks like you did not provide the sample audio file so one cannot reproduce the example. You could share it (or another one) online and use !wget url_to_that_file.wav directly in the notebook to make it self-contained.

  2. To get a better idea of where time is spent, you can wrap the call to the pipeline with a progress hook, or a timing hook, or both

from pyannote.audio.pipelines.utils.hook import Hook, ProgressHook, TimingHook
file = {"audio": ...}"

# progress hook alone (will show progress bar)
with ProgressHook() as hook:
    diarization = pipeline(file, hook=hook)

# timing hook alone (will add a "timing" key in file)
with TimingHook() as hook:
    diarization = pipeline(file, hook=hook)

# both
with Hook(ProgressHook(), TimingHook()) as hook:
    diarization = pipeline(file, hook=hook)

@a-rogalska
Copy link
Author

Thanks for the hint, I updated the notebook with the sample audio from tutorials and hooks. According to them, embedding step takes much longer in the new version.

@hbredin
Copy link
Member

hbredin commented Jan 23, 2024

Thanks to the completed MRE, I can now reproduce the issue.

The main difference between 3.0 and 3.1 is the switch from ONNX to pytorch inference.

On GPU: pytorch is faster than ONNX.
On CPU: ONNX is faster than pytorch.

Could anyone using pyannote in production on CPU chime in?
Any idea on how to make pytorch CPU inference faster?
I'd like to avoid going back to ONNX as it was apparently painful for GPU users.

@askiefer
Copy link

Any update on this @hbredin 🙏 ?

@hbredin
Copy link
Member

hbredin commented Jan 25, 2024

No update... hence the help wanted tag ;-)
Hopefully one of the many users of pyannote will chime in.

@mengjie-du
Copy link

It has been noticed that the 3.1 pipeline efficiency suffers from speaker embedding inference. With the default config, every 10s chunk has to undergo inference 3 times by the embedding model. It proves effective by separating the whole embedding model pipeline into the resnet backbone and the mask pooling. With this modification, every chunk only needs to be inferred one time through the backbone, bringing almost 3x speedup in my experiment. Furthermore, cache inference strategy helps a lot as well, given the default overlapped ratio of 90%.

@marrrcin
Copy link

I think that the main problem lies in the

waveform, _ = self._audio.crop(
file,
chunk,
duration=duration,
mode="pad",

It seems like for longer files, the .crop call is taking much longer than embedding of the chunk (no matter whether it's CPU or CUDA). The easiest way to reproduce it is just to use 3.1.1 version with a wav file that is ~1h long. It's basically unusable for long audio files.

@hbredin
Copy link
Member

hbredin commented Feb 14, 2024

@marrrcin these are two different problems.
Your problem can be solved by loading the file in memory first.

@marrrcin
Copy link

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

@hbredin
Copy link
Member

hbredin commented Feb 20, 2024

Happy that your problem is solved and that you "tolerate" the performance of pyannote (that you use for free, by the way).

@kenplusplus
Copy link

kenplusplus commented Apr 7, 2024

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

Thanks @marrrcin 's sharing, Have you tested on CPU?

@marrrcin
Copy link

marrrcin commented Apr 7, 2024

Thanks @hbredin , loading into memory really helped - with that, the performance is tolerable and 1h file finishes within a few minutes (<5 mins on GPU).

Thanks @marrrcin 's sharing, Have you tested on CPU?

No, I was running it on a GPU.

@kenplusplus
Copy link

I have tested with "Diarization pipeline v3.0" by using CPU, and also found its latency is less than v3.1 (50s -> 30s)

@JuergenFleiss
Copy link

Just to chime in with a comparison for CPU between 3.0 and 3.1. No loading the file into memory.

The difference is massive for longer files. A 22 minute file on a Ryzen 6850U.

  • 27 minutes for the embeddings in 3.1
  • 2 minutes 40 seconds for the embeddings in 3.0

We observed similar long embedding times on M1 and Intel.

@JuergenFleiss
Copy link

Just to chime in with a comparison for CPU between 3.0 and 3.1. No loading the file into memory.

The difference is massive for longer files. A 22 minute file on a Ryzen 6850U.

* 27 minutes for the embeddings in 3.1

* 2 minutes 40 seconds for the embeddings in 3.0

We observed similar long embedding times on M1 and Intel.

@hbredin Just tried out pyannote 1.2 and empbeddings are much faster again in CPU. Did you change somethin in this regard?

Again a 22 minute file on a Ryzen 6850U.
*1 minute, 48 seconds in 3.2
* 27 minutes for the embeddings in 3.1
* 2 minutes 40 seconds for the embeddings in 3.0

@hbredin
Copy link
Member

hbredin commented May 8, 2024

I did not. But happy that problem is solved.

@JuergenFleiss
Copy link

I did not. But happy that problem is solved.

maybe it was the torch update...

@rmchale
Copy link

rmchale commented May 23, 2024

@JuergenFleiss

Just tried out pyannote 1.2 and empbeddings are much faster again in CPU. Did you change somethin in this regard?

1.2 of pyannote? From here? https://pypi.org/project/pyannote.audio/#history

@JuergenFleiss
Copy link

@JuergenFleiss

Just tried out pyannote 1.2 and empbeddings are much faster again in CPU. Did you change somethin in this regard?

1.2 of pyannote? From here? https://pypi.org/project/pyannote.audio/#history

Should of course have been 3.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants