Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Medusa support #2319

Open
EmilioZhao opened this issue May 10, 2024 · 6 comments
Open

[Feature Request] Medusa support #2319

EmilioZhao opened this issue May 10, 2024 · 6 comments
Labels
feature request New feature or request

Comments

@EmilioZhao
Copy link

EmilioZhao commented May 10, 2024

🚀 Feature

Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device.
refers to: https://github.com/FasterDecoding/Medusa/tree/main
Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training.

Motivation

Medusa is an excellent solution for speeding up LLM decoding 2.2~3.6x without affecting original model performance.
It solved problems of current speculative decoding such as requirement of a good draft model, complex system and inefficiency when using sampling-based generation.

TVM or MLC-LLM aims to deploy models everywhere especially on mobile devices which requires excellent memory management and cost efficient inference with extremely limited resources. Therefore, implementing such a speedup monster would greatly enhance the impact and visibility of MLC-LLM.

Alternatives

Speculative decoding with draft model like Eagle, but it requires careful training of draft model for performance.

Additional context

We've tried to implement it on MLC-LLM but found that it's rather difficult to implement "Tree-based Attention" and "KV cache update part" with current MLC-LLM complicated code structure. Therefore we resort to the community.

@EmilioZhao EmilioZhao added the feature request New feature or request label May 10, 2024
@jpf888
Copy link

jpf888 commented May 10, 2024

+1、

@vinx13
Copy link
Member

vinx13 commented May 10, 2024

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

@jpf888
Copy link

jpf888 commented May 11, 2024

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

hi @vinx13

Will the tree decoding kernel be released next week or will it take longer?

@EmilioZhao
Copy link
Author

I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support

Glad to hear that @vinx13 and thanks a bunch for your quick reply! Look forward to seeing your pull request. Are you also working on the Tree-based attention?

@vinx13
Copy link
Member

vinx13 commented May 14, 2024

Initial support for Medusa is added in #2337 , tree decoding is not yet supported as more work is required

@EmilioZhao
Copy link
Author

Thanks a lot! We'll try Medusa list decoding first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants