Whisper dynamic time warping #2728

nicksenger · 2025-01-18T21:23:15Z

Hi again, I thought it'd be useful to have these token/word-level timestamps available from the whisper implementation here.

The first commit adds an option (--cache) to use the kv cache on the decoder self attention, in which case after the initial pass, only the final token is needed. My main goal isn't really to improve performance, but I couldn't get the timestamps working without this change. The speed gains on the CPU appear significant (about 4x in my case), though I haven't done any true benchmarking and any gains on the GPU or with the quantized models are less noticeable.

For the timestamps I mainly followed OpenAI's python implementation, and given the same inputs the timestamps should match closely.

With these changes, passing the --dtw-timestamps flag prints the word-level timestamps:

Word { text: "and", start: 0.0, end: 0.7 }
Word { text: "so", start: 0.7, end: 1.06 }
Word { text: "my", start: 1.06, end: 1.44 }
Word { text: "fellow", start: 1.44, end: 1.74 }
Word { text: "americans", start: 1.74, end: 2.34 }
Word { text: "ask", start: 2.34, end: 3.86 }
Word { text: "not", start: 3.86, end: 4.54 }
Word { text: "what", start: 4.54, end: 5.68 }
Word { text: "your", start: 5.68, end: 6.02 }
Word { text: "country", start: 6.02, end: 6.34 }
Word { text: "can", start: 6.34, end: 6.76 }
Word { text: "do", start: 6.76, end: 7.0 }
Word { text: "for", start: 7.0, end: 7.24 }
Word { text: "you", start: 7.24, end: 8.06 }
Word { text: "ask", start: 8.06, end: 8.64 }
Word { text: "what", start: 8.64, end: 8.94 }
Word { text: "you", start: 8.94, end: 9.26 }
Word { text: "can", start: 9.26, end: 9.48 }
Word { text: "do", start: 9.48, end: 9.72 }
Word { text: "for", start: 9.72, end: 9.9 }
Word { text: "your", start: 9.9, end: 10.22 }
Word { text: "country", start: 10.22, end: 10.54 }

Time permitting, here are some things I'd still like to do here in no particular order:

Verify that this doesn't break any existing whisper functionality
Verify that this produces output consistent with the referenced implementation (9109c02 gets raw output for the JFK sample to sub-millisecond consistency)
Add alignment heads for the distilled models: (57346e1 adds small en and large v3)
Improve the post-processing logic (15e0b1d adds some merging and part-of-speech handling, still pretty basic)

Any feedback is appreciated 👍

nicksenger · 2025-01-22T08:21:08Z

Taking this out of draft as I'm not seeing any issues with the cache usage or timestamps. From some training experiments, enabling this caching appears to produce the same result, just with a very slight speed increase.

I put some comments on the inference loop changes, which mostly just get the timestamps as close as possible to those produced by the OpenAI code. Transformers, whisper-cpp, etc all appear to produce different timestamps, and I'm not sure which is most accurate - just wanted to match one as a way verifying the implementation.

nicksenger added 2 commits January 17, 2025 06:29

feat: Add option to use whisper self attn kv cache

fc7afdc

feat: Add whisper dtw timestamps

7eda8e7

nicksenger marked this pull request as draft January 18, 2025 21:23

nicksenger added 6 commits January 18, 2025 23:26

feat: Add distil small-en and large-v3 alignment heads

57346e1

fix: Timestamp deviation from reference impl

9109c02

fix: No tokens edge case

74d97e1

feat: Improve post-processing

15e0b1d

feat: Include tokens on DTW output words

5d35683

chore: Add comments for inference loop changes

f621a54

nicksenger marked this pull request as ready for review January 22, 2025 08:21

nicksenger changed the title ~~[Draft]: Whisper dynamic time warping~~ Whisper dynamic time warping Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper dynamic time warping #2728

Whisper dynamic time warping #2728

nicksenger commented Jan 18, 2025 •

edited

Loading

nicksenger commented Jan 22, 2025

Whisper dynamic time warping #2728

Are you sure you want to change the base?

Whisper dynamic time warping #2728

Conversation

nicksenger commented Jan 18, 2025 • edited Loading

nicksenger commented Jan 22, 2025

nicksenger commented Jan 18, 2025 •

edited

Loading