Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper dynamic time warping #2728

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

nicksenger
Copy link
Contributor

@nicksenger nicksenger commented Jan 18, 2025

Hi again, I thought it'd be useful to have these token/word-level timestamps available from the whisper implementation here.

The first commit adds an option (--cache) to use the kv cache on the decoder self attention, in which case after the initial pass, only the final token is needed. My main goal isn't really to improve performance, but I couldn't get the timestamps working without this change. The speed gains on the CPU appear significant (about 4x in my case), though I haven't done any true benchmarking and any gains on the GPU or with the quantized models are less noticeable.

For the timestamps I mainly followed OpenAI's python implementation, and given the same inputs the timestamps should match closely.

With these changes, passing the --dtw-timestamps flag prints the word-level timestamps:

Word { text: "and", start: 0.0, end: 0.7 }
Word { text: "so", start: 0.7, end: 1.06 }
Word { text: "my", start: 1.06, end: 1.44 }
Word { text: "fellow", start: 1.44, end: 1.74 }
Word { text: "americans", start: 1.74, end: 2.34 }
Word { text: "ask", start: 2.34, end: 3.86 }
Word { text: "not", start: 3.86, end: 4.54 }
Word { text: "what", start: 4.54, end: 5.68 }
Word { text: "your", start: 5.68, end: 6.02 }
Word { text: "country", start: 6.02, end: 6.34 }
Word { text: "can", start: 6.34, end: 6.76 }
Word { text: "do", start: 6.76, end: 7.0 }
Word { text: "for", start: 7.0, end: 7.24 }
Word { text: "you", start: 7.24, end: 8.06 }
Word { text: "ask", start: 8.06, end: 8.64 }
Word { text: "what", start: 8.64, end: 8.94 }
Word { text: "you", start: 8.94, end: 9.26 }
Word { text: "can", start: 9.26, end: 9.48 }
Word { text: "do", start: 9.48, end: 9.72 }
Word { text: "for", start: 9.72, end: 9.9 }
Word { text: "your", start: 9.9, end: 10.22 }
Word { text: "country", start: 10.22, end: 10.54 }

Time permitting, here are some things I'd still like to do here in no particular order:

  • Verify that this doesn't break any existing whisper functionality
  • Verify that this produces output consistent with the referenced implementation (9109c02 gets raw output for the JFK sample to sub-millisecond consistency)
  • Add alignment heads for the distilled models: (57346e1 adds small en and large v3)
  • Improve the post-processing logic (15e0b1d adds some merging and part-of-speech handling, still pretty basic)

Any feedback is appreciated 👍

@nicksenger nicksenger marked this pull request as draft January 18, 2025 21:23
@nicksenger
Copy link
Contributor Author

Taking this out of draft as I'm not seeing any issues with the cache usage or timestamps. From some training experiments, enabling this caching appears to produce the same result, just with a very slight speed increase.

I put some comments on the inference loop changes, which mostly just get the timestamps as close as possible to those produced by the OpenAI code. Transformers, whisper-cpp, etc all appear to produce different timestamps, and I'm not sure which is most accurate - just wanted to match one as a way verifying the implementation.

@nicksenger nicksenger marked this pull request as ready for review January 22, 2025 08:21
@nicksenger nicksenger changed the title [Draft]: Whisper dynamic time warping Whisper dynamic time warping Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant