-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature Request] Medusa support #2319
Comments
+1、 |
I’m working on this and will hopefully upstream it next week. We have eagle spec decoding now without tree decoding. To support tree decoding we also need the kernel support |
hi @vinx13 Will the tree decoding kernel be released next week or will it take longer? |
Glad to hear that @vinx13 and thanks a bunch for your quick reply! Look forward to seeing your pull request. Are you also working on the Tree-based attention? |
Initial support for Medusa is added in #2337 , tree decoding is not yet supported as more work is required |
Thanks a lot! We'll try Medusa list decoding first. |
🚀 Feature
Please add Medusa decoding in mlc-llm in C++, we urgently needed it to speedup LLM decoding on mobile device.
refers to: https://github.com/FasterDecoding/Medusa/tree/main
Medusa adds extra "heads" to LLMs to predict multiple future tokens simultaneously. When augmenting a model with Medusa, the original model stays untouched, and only the new heads are fine-tuned during training.
Motivation
Medusa is an excellent solution for speeding up LLM decoding 2.2~3.6x without affecting original model performance.
It solved problems of current speculative decoding such as requirement of a good draft model, complex system and inefficiency when using sampling-based generation.
TVM or MLC-LLM aims to deploy models everywhere especially on mobile devices which requires excellent memory management and cost efficient inference with extremely limited resources. Therefore, implementing such a speedup monster would greatly enhance the impact and visibility of MLC-LLM.
Alternatives
Speculative decoding with draft model like Eagle, but it requires careful training of draft model for performance.
Additional context
We've tried to implement it on MLC-LLM but found that it's rather difficult to implement "Tree-based Attention" and "KV cache update part" with current MLC-LLM complicated code structure. Therefore we resort to the community.
The text was updated successfully, but these errors were encountered: