audio-to-text pipeline fails on return_timestamps=word #390

ad-astra-video · 2025-01-03T14:32:31Z

Describe the bug

audio-to-text pipeline is not returning word level timestamps.

@RUFFY-369 is there a way to change to sdpa if word level timestamps is requested without reloading the pipeline to the gpu?

Reproduction steps

Download new audio-to-text pipeline with flash attention 2 enabled
Send request to pipeline including return_timestamps=word
curl -X POST http://172.17.0.1:6666/audio-to-text -F "[email protected]" -F "model_id=openai/whisper-large-v3" -F "return_timestamps=word"
See error returned
{"error":{"message":": Error during model execution: WhisperFlashAttention2 attention does not support output_attentions."}}

Expected behaviour

Return word level timestamps.

Severity

None

Screenshots / Live demo link

No response

OS

None

Running on

None

AI-worker version

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

RUFFY-369 · 2025-01-06T12:55:39Z

Describe the bug

audio-to-text pipeline is not returning word level timestamps.

@RUFFY-369 is there a way to change to sdpa if word level timestamps is requested without reloading the pipeline to the gpu?

Reproduction steps

Download new audio-to-text pipeline with flash attention 2 enabled

Send request to pipeline including return_timestamps=word
curl -X POST http://172.17.0.1:6666/audio-to-text -F "[email protected]" -F "model_id=openai/whisper-large-v3" -F "return_timestamps=word"

See error returned
{"error":{"message":": Error during model execution: WhisperFlashAttention2 attention does not support output_attentions."}}

Expected behaviour

Return word level timestamps.

Severity

None

Screenshots / Live demo link

No response

OS

None

Running on

None

AI-worker version

No response

Additional context

No response

@ad-astra-video We can't change attention implemention in __call__ without reinitializing the pipeline because pipeline initialization initializes the model and here in the WhisperEncoderLayer and WhisperDecoderLayer gets their self_attn initialized. So, basically even if you want to change the attn_implementation in __call__ , you need to change the attention layers in the above mentioned layers directly, which just means reinitializing the model itself. Also, for the testing's sake I tried to dynamically change attention implemention in __call__ with self.tm.model.config._attn_implementation and it didn't work.

If we really want to change attn_implementation to sdpa then we have to change the attention layers in encoder's and decoder's attention modules with model.named_modules() which basically almost amount to just reinitializing the model.

Or an alternate solution can be to have an extra model instance for word level timestamps without reinitializing the pipeline

ad-astra-video added the bug Something isn't working label Jan 3, 2025

ad-astra-video assigned RUFFY-369 Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

audio-to-text pipeline fails on return_timestamps=word #390

audio-to-text pipeline fails on return_timestamps=word #390

ad-astra-video commented Jan 3, 2025 •

edited

Loading

RUFFY-369 commented Jan 6, 2025 •

edited

Loading

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

audio-to-text pipeline fails on return_timestamps=word #390

audio-to-text pipeline fails on return_timestamps=word #390

Comments

ad-astra-video commented Jan 3, 2025 • edited Loading

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

RUFFY-369 commented Jan 6, 2025 • edited Loading

Describe the bug

Reproduction steps

Expected behaviour

Severity

Screenshots / Live demo link

OS

Running on

AI-worker version

Additional context

ad-astra-video commented Jan 3, 2025 •

edited

Loading

RUFFY-369 commented Jan 6, 2025 •

edited

Loading