-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge to upstream flash-attention repo #35
Comments
I too am hoping for a merge, currently this repo is some versions behind the upstream flash-attention repo which breaks compatibilitly with other open source projects that use flash-attention. For Example exllamav2 as described below: It feels like we are just a step away where where ROCm could be used in a lot of projects that would help owners of consumer gpus. Flash-Attention is a big step towards that goal. I am grateful for all the contributors! |
There will be a new version of Composable Kernel coming out shortly. We will redesign the FA based on that one so I think we can request to merge into the upstream for that new version. |
@howiejayz @sabreshao Are there any updates regarding bumping the version? |
@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4. Do you have any suggestions regarding this issue? Any help would be greatly appreciated. |
Worth mentioning here, about exllamav2 and what it works with - |
@nktice Agreed, in Transformers we had to hack something to keep compatibility with 2.0.4 https://github.com/huggingface/transformers/blob/1ab71364886010c31b20dd8c8bb0c60f8a0681ad/src/transformers/models/llama/modeling_llama.py#L418 |
We already started development of new flash 2.5.5 internally and hd256 support will be included. ETA is still under discussion. |
Will this be for all officially supported gpus so rx7900xtx and 7900xt included ? |
This is the important part from that issue:
|
AOTriton has been updated in PyTorch nightly to add support for Navi31, which enables Triton-based Flash Attention in PyTorch for Navi31 out of the box: |
@evshiron , does this Triton-based Flash Attention support AMD MI60/gfx906 cards? |
I don't have these graphics cards, but in a quick search I could not find any code mentioning |
Hi @Said-Akbar , currently Triton-based FA is only supported on Navi. |
I am requesting that you merge with the upstream flash-attention repo, in order to garner community engagement and improving integration and distribution.
This separation is a major blocker to AMD adoption for AI model training and inference use cases.
Dao-AILab#707
The text was updated successfully, but these errors were encountered: