Poor output quality and speed when using batched inference #954

ben256 · 2024-08-07T14:37:53Z

I'm do some work on developing an effective solution for code-switching audio, specifically between Arabic and English. I have a fine-tuned Whisper (large-v2) model which gives an accurate Arabic/English output, and this has been working fine when use with Faster-Whisper. But recently I've been looking at batched inferencing and have been experiencing some strange issues.

Transcription Quality

This has dramatically decreased when using batched inference. The single inference output is almost perfect, and the same as when using the original transformers version of the model, but the batched version suffers from hallucinations and is completely unusable.

Single Inference:

>> [0.88s -> 23.16s] Hi عمر Hi how are you? I'm doing well how are you? I'm excited صراحة for this episode و أنا كمان okay عمر so أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان sure and اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years
>> [23.16s -> 25.98s] هو يعني شوية لك؟

Batched Inference:

>> [1.20s -> 26.10s] Hi عمر مرحبا كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟

It is worth noting the transcription quality is better gets better by setting use_vad_model=False when creating the BatchedInferencePipeline. The actual text is fairly accurate to the audio, but it still doesn't offer the granularity of language which is needed for a code-switching transcript and is present in the single inference output.

Batched Inference (use_vad_model=False):

>> [0.80s -> 24.30s] Hi عمر مرحبا كيف حالك؟ انا بخير كيف حالك؟ انا متحمس صراحة من هذا الفيديو و انا كمان okay عمر انا عادى بحب اخلى مهاراتي يعرفوا عن حالهم فاطلخ العنان sure انا اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years oh yeah

Transcription Speed

Tracking the processes with CUDA events shows batched inference taking signifcantly longer than the single inference.

Batched time (s): 0.466946044921875
Single time (s): 0.004865024089813232

Having said that, I imagine this is due to my test audio only being 30 seconds long, so as the audio length increases the batched inference will be increasingly efficient (though correct me if I'm wrong).

Note

These tests have all been carried out with only 3 parameters set: beam_size=5, language="ar", word_timestamps=True using the methods defined in the read me, with a batch size of 16 for the batched inference.

So I think I'd just like to know how I can emulate the single inference transcription quality but using batched inference, as I'm currently using a forked version of WhisperX for transcription which is where the batched inference comes from. I am building a custom solution on top of faster-whisper so could always switch to single inference, but I don't think this makes sense given the apparent performance benefits and infracture I have built around batched inference in my custom WhisperX.

And:

Why does the VAD model make the performance so much worse?
Even without it what is causing the difference between batched and single inference?
There are some set parameters in the BatchedInferencePipeline such as condition_on_previous_text or prompt_reset_on_temperature, maybe these have an effect?
What other parameters can I change to try and replicate the single inference performance?
I haven't looked into it but can I use sampling to try and improve the output?

The text was updated successfully, but these errors were encountered:

MahmoudAshraf97 · 2024-08-08T08:48:47Z

can you try #936 and see if it solves any of the problems?

ben256 · 2024-08-08T16:16:25Z

Ah yep pulled your fork and it works much better thanks. Set the params to as is in the PR to these without_timestamps=True, vad_filter=True, chunk_length=25, and got this as an ouput:

>> [0.94s -> 24.18s] Hi عمر. Hi. How are you? I'm doing well, how are you? I'm excited صراحة for this episode. وأنا كمان. Okay عمر. So أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان. Sure. أنا اسمي عمر شكري. I'm a singer from Jordan. I've been singing my whole life بس professionally in Jordan for the past five years. Oh yeah.
>> [24.88s -> 26.06s] شويه لك؟

I would say it's actually better than the sequential inference as the sentence tokenization is much better, I guess thats the VAD model...? But I've found that more sentences improves diarization performance so thank you! Will have a play around with the VAD params on some other test audio but looking good so far!

ben256 changed the title ~~Poor output quality and speed when using batching~~ Poor output quality and speed when using batched inference Aug 7, 2024

aligokalppeker mentioned this issue Aug 8, 2024

Batching inference commit should be reverted and applied part-by-part for community adaptation !!!! #937

Open

MahmoudAshraf97 closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor output quality and speed when using batched inference #954

Poor output quality and speed when using batched inference #954

ben256 commented Aug 7, 2024 •

edited

Loading

MahmoudAshraf97 commented Aug 8, 2024

ben256 commented Aug 8, 2024

Poor output quality and speed when using batched inference #954

Poor output quality and speed when using batched inference #954

Comments

ben256 commented Aug 7, 2024 • edited Loading

Transcription Quality

Transcription Speed

Note

MahmoudAshraf97 commented Aug 8, 2024

ben256 commented Aug 8, 2024

ben256 commented Aug 7, 2024 •

edited

Loading