[Feature Request] Medusa support #2319

EmilioZhao · 2024-05-10T14:56:33Z

🚀 Feature

Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device.
refers to: https://github.com/FasterDecoding/Medusa/tree/main
Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training.

Motivation

Medusa is an excellent solution for speeding up LLM decoding 2.2~3.6x without affecting original model performance.
It solved problems of current speculative decoding such as requirement of a good draft model, complex system and inefficiency when using sampling-based generation.

TVM or MLC-LLM aims to deploy models everywhere especially on mobile devices which requires excellent memory management and cost efficient inference with extremely limited resources. Therefore, implementing such a speedup monster would greatly enhance the impact and visibility of MLC-LLM.

Alternatives

Speculative decoding with draft model like Eagle, but it requires careful training of draft model for performance.

Additional context

We've tried to implement it on MLC-LLM but found that it's rather difficult to implement "Tree-based Attention" and "KV cache update part" with current MLC-LLM complicated code structure. Therefore we resort to the community.

jpf888 · 2024-05-10T15:05:15Z

+1、

vinx13 · 2024-05-10T16:34:35Z

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

jpf888 · 2024-05-11T03:39:48Z

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

hi @vinx13

Will the tree decoding kernel be released next week or will it take longer?

EmilioZhao · 2024-05-11T03:56:27Z

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

Glad to hear that @vinx13 and thanks a bunch for your quick reply! Look forward to seeing your pull request. Are you also working on the Tree-based attention?

vinx13 · 2024-05-14T01:27:56Z

Initial support for Medusa is added in #2337 , tree decoding is not yet supported as more work is required

EmilioZhao · 2024-05-15T08:42:08Z

Thanks a lot! We'll try Medusa list decoding first.

EmilioZhao added the feature request New feature or request label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Medusa support #2319

[Feature Request] Medusa support #2319

EmilioZhao commented May 10, 2024 •

edited

jpf888 commented May 10, 2024

vinx13 commented May 10, 2024 •

edited

jpf888 commented May 11, 2024

EmilioZhao commented May 11, 2024

vinx13 commented May 14, 2024

EmilioZhao commented May 15, 2024

[Feature Request] Medusa support #2319

[Feature Request] Medusa support #2319

Comments

EmilioZhao commented May 10, 2024 • edited

🚀 Feature

Motivation

Alternatives

Additional context

jpf888 commented May 10, 2024

vinx13 commented May 10, 2024 • edited

jpf888 commented May 11, 2024

EmilioZhao commented May 11, 2024

vinx13 commented May 14, 2024

EmilioZhao commented May 15, 2024

EmilioZhao commented May 10, 2024 •

edited

vinx13 commented May 10, 2024 •

edited