Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor output quality and speed when using batched inference #954

Closed
ben256 opened this issue Aug 7, 2024 · 2 comments
Closed

Poor output quality and speed when using batched inference #954

ben256 opened this issue Aug 7, 2024 · 2 comments

Comments

@ben256
Copy link

ben256 commented Aug 7, 2024

I'm do some work on developing an effective solution for code-switching audio, specifically between Arabic and English. I have a fine-tuned Whisper (large-v2) model which gives an accurate Arabic/English output, and this has been working fine when use with Faster-Whisper. But recently I've been looking at batched inferencing and have been experiencing some strange issues.

Transcription Quality

This has dramatically decreased when using batched inference. The single inference output is almost perfect, and the same as when using the original transformers version of the model, but the batched version suffers from hallucinations and is completely unusable.

Single Inference:

>> [0.88s -> 23.16s] Hi عمر Hi how are you? I'm doing well how are you? I'm excited صراحة for this episode و أنا كمان okay عمر so أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان sure and اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years
>> [23.16s -> 25.98s] هو يعني شوية لك؟

Batched Inference:

>> [1.20s -> 26.10s] Hi عمر مرحبا كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟

It is worth noting the transcription quality is better gets better by setting use_vad_model=False when creating the BatchedInferencePipeline. The actual text is fairly accurate to the audio, but it still doesn't offer the granularity of language which is needed for a code-switching transcript and is present in the single inference output.

Batched Inference (use_vad_model=False):

>> [0.80s -> 24.30s] Hi عمر مرحبا كيف حالك؟ انا بخير كيف حالك؟ انا متحمس صراحة من هذا الفيديو و انا كمان okay عمر انا عادى بحب اخلى مهاراتي يعرفوا عن حالهم فاطلخ العنان sure انا اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years oh yeah

Transcription Speed

Tracking the processes with CUDA events shows batched inference taking signifcantly longer than the single inference.

Batched time (s): 0.466946044921875
Single time (s): 0.004865024089813232

Having said that, I imagine this is due to my test audio only being 30 seconds long, so as the audio length increases the batched inference will be increasingly efficient (though correct me if I'm wrong).

Note

These tests have all been carried out with only 3 parameters set: beam_size=5, language="ar", word_timestamps=True using the methods defined in the read me, with a batch size of 16 for the batched inference.


So I think I'd just like to know how I can emulate the single inference transcription quality but using batched inference, as I'm currently using a forked version of WhisperX for transcription which is where the batched inference comes from. I am building a custom solution on top of faster-whisper so could always switch to single inference, but I don't think this makes sense given the apparent performance benefits and infracture I have built around batched inference in my custom WhisperX.

And:

  • Why does the VAD model make the performance so much worse?
  • Even without it what is causing the difference between batched and single inference?
  • There are some set parameters in the BatchedInferencePipeline such as condition_on_previous_text or prompt_reset_on_temperature, maybe these have an effect?
  • What other parameters can I change to try and replicate the single inference performance?
  • I haven't looked into it but can I use sampling to try and improve the output?
@ben256 ben256 changed the title Poor output quality and speed when using batching Poor output quality and speed when using batched inference Aug 7, 2024
@MahmoudAshraf97
Copy link
Collaborator

can you try #936 and see if it solves any of the problems?

@ben256
Copy link
Author

ben256 commented Aug 8, 2024

Ah yep pulled your fork and it works much better thanks. Set the params to as is in the PR to these without_timestamps=True, vad_filter=True, chunk_length=25, and got this as an ouput:

>> [0.94s -> 24.18s] Hi عمر. Hi. How are you? I'm doing well, how are you? I'm excited صراحة for this episode. وأنا كمان. Okay عمر. So أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان. Sure. أنا اسمي عمر شكري. I'm a singer from Jordan. I've been singing my whole life بس professionally in Jordan for the past five years. Oh yeah.
>> [24.88s -> 26.06s] شويه لك؟

I would say it's actually better than the sequential inference as the sentence tokenization is much better, I guess thats the VAD model...? But I've found that more sentences improves diarization performance so thank you! Will have a play around with the VAD params on some other test audio but looking good so far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants