You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm do some work on developing an effective solution for code-switching audio, specifically between Arabic and English. I have a fine-tuned Whisper (large-v2) model which gives an accurate Arabic/English output, and this has been working fine when use with Faster-Whisper. But recently I've been looking at batched inferencing and have been experiencing some strange issues.
Transcription Quality
This has dramatically decreased when using batched inference. The single inference output is almost perfect, and the same as when using the original transformers version of the model, but the batched version suffers from hallucinations and is completely unusable.
Single Inference:
>> [0.88s -> 23.16s] Hi عمر Hi how are you? I'm doing well how are you? I'm excited صراحة for this episode و أنا كمان okay عمر so أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان sure and اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years
>> [23.16s -> 25.98s] هو يعني شوية لك؟
Batched Inference:
>> [1.20s -> 26.10s] Hi عمر مرحبا كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟ كيف حالك؟
It is worth noting the transcription quality is better gets better by setting use_vad_model=False when creating the BatchedInferencePipeline. The actual text is fairly accurate to the audio, but it still doesn't offer the granularity of language which is needed for a code-switching transcript and is present in the single inference output.
Batched Inference (use_vad_model=False):
>> [0.80s -> 24.30s] Hi عمر مرحبا كيف حالك؟ انا بخير كيف حالك؟ انا متحمس صراحة من هذا الفيديو و انا كمان okay عمر انا عادى بحب اخلى مهاراتي يعرفوا عن حالهم فاطلخ العنان sure انا اسمي عمر شكري I'm a singer from Jordan I've been singing my whole life بس professionally in Jordan for the past five years oh yeah
Transcription Speed
Tracking the processes with CUDA events shows batched inference taking signifcantly longer than the single inference.
Batched time (s): 0.466946044921875
Single time (s): 0.004865024089813232
Having said that, I imagine this is due to my test audio only being 30 seconds long, so as the audio length increases the batched inference will be increasingly efficient (though correct me if I'm wrong).
Note
These tests have all been carried out with only 3 parameters set: beam_size=5, language="ar", word_timestamps=True using the methods defined in the read me, with a batch size of 16 for the batched inference.
So I think I'd just like to know how I can emulate the single inference transcription quality but using batched inference, as I'm currently using a forked version of WhisperX for transcription which is where the batched inference comes from. I am building a custom solution on top of faster-whisper so could always switch to single inference, but I don't think this makes sense given the apparent performance benefits and infracture I have built around batched inference in my custom WhisperX.
And:
Why does the VAD model make the performance so much worse?
Even without it what is causing the difference between batched and single inference?
There are some set parameters in the BatchedInferencePipeline such as condition_on_previous_text or prompt_reset_on_temperature, maybe these have an effect?
What other parameters can I change to try and replicate the single inference performance?
I haven't looked into it but can I use sampling to try and improve the output?
The text was updated successfully, but these errors were encountered:
ben256
changed the title
Poor output quality and speed when using batching
Poor output quality and speed when using batched inference
Aug 7, 2024
Ah yep pulled your fork and it works much better thanks. Set the params to as is in the PR to these without_timestamps=True, vad_filter=True, chunk_length=25, and got this as an ouput:
>> [0.94s -> 24.18s] Hi عمر. Hi. How are you? I'm doing well, how are you? I'm excited صراحة for this episode. وأنا كمان. Okay عمر. So أنا usually بحب أخلي my guests يعرفوا عن حالهم فأطلق العنان. Sure. أنا اسمي عمر شكري. I'm a singer from Jordan. I've been singing my whole life بس professionally in Jordan for the past five years. Oh yeah.
>> [24.88s -> 26.06s] شويه لك؟
I would say it's actually better than the sequential inference as the sentence tokenization is much better, I guess thats the VAD model...? But I've found that more sentences improves diarization performance so thank you! Will have a play around with the VAD params on some other test audio but looking good so far!
I'm do some work on developing an effective solution for code-switching audio, specifically between Arabic and English. I have a fine-tuned Whisper (large-v2) model which gives an accurate Arabic/English output, and this has been working fine when use with Faster-Whisper. But recently I've been looking at batched inferencing and have been experiencing some strange issues.
Transcription Quality
This has dramatically decreased when using batched inference. The single inference output is almost perfect, and the same as when using the original transformers version of the model, but the batched version suffers from hallucinations and is completely unusable.
Single Inference:
Batched Inference:
It is worth noting the transcription quality is better gets better by setting use_vad_model=False when creating the BatchedInferencePipeline. The actual text is fairly accurate to the audio, but it still doesn't offer the granularity of language which is needed for a code-switching transcript and is present in the single inference output.
Batched Inference (use_vad_model=False):
Transcription Speed
Tracking the processes with CUDA events shows batched inference taking signifcantly longer than the single inference.
Having said that, I imagine this is due to my test audio only being 30 seconds long, so as the audio length increases the batched inference will be increasingly efficient (though correct me if I'm wrong).
Note
These tests have all been carried out with only 3 parameters set:
beam_size=5, language="ar", word_timestamps=True
using the methods defined in the read me, with a batch size of 16 for the batched inference.So I think I'd just like to know how I can emulate the single inference transcription quality but using batched inference, as I'm currently using a forked version of WhisperX for transcription which is where the batched inference comes from. I am building a custom solution on top of faster-whisper so could always switch to single inference, but I don't think this makes sense given the apparent performance benefits and infracture I have built around batched inference in my custom WhisperX.
And:
The text was updated successfully, but these errors were encountered: