Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge to upstream flash-attention repo #35

Open
ehartford opened this issue Jan 18, 2024 · 14 comments
Open

Merge to upstream flash-attention repo #35

ehartford opened this issue Jan 18, 2024 · 14 comments

Comments

@ehartford
Copy link

I am requesting that you merge with the upstream flash-attention repo, in order to garner community engagement and improving integration and distribution.

This separation is a major blocker to AMD adoption for AI model training and inference use cases.

Dao-AILab#707

@Taylantz
Copy link

I too am hoping for a merge, currently this repo is some versions behind the upstream flash-attention repo which breaks compatibilitly with other open source projects that use flash-attention. For Example exllamav2 as described below:

oobabooga#3759

It feels like we are just a step away where where ROCm could be used in a lot of projects that would help owners of consumer gpus. Flash-Attention is a big step towards that goal. I am grateful for all the contributors!

@dejay-vu
Copy link

There will be a new version of Composable Kernel coming out shortly. We will redesign the FA based on that one so I think we can request to merge into the upstream for that new version.

@fxmarty
Copy link

fxmarty commented Feb 29, 2024

@howiejayz @sabreshao Are there any updates regarding bumping the version?

@xxtars
Copy link

xxtars commented Apr 1, 2024

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.

Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

@nktice
Copy link

nktice commented Apr 3, 2024

Worth mentioning here, about exllamav2 and what it works with -
turns out this version isn't really of much use - here is their reply...
turboderp-org/exllamav2#397 (comment)

@fxmarty
Copy link

fxmarty commented Apr 5, 2024

@nktice Agreed, in Transformers we had to hack something to keep compatibility with 2.0.4 https://github.com/huggingface/transformers/blob/1ab71364886010c31b20dd8c8bb0c60f8a0681ad/src/transformers/models/llama/modeling_llama.py#L418

@sabreshao
Copy link
Collaborator

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.

Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

We already started development of new flash 2.5.5 internally and hd256 support will be included. ETA is still under discussion.

@Kademo15
Copy link

Kademo15 commented Apr 10, 2024

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.
Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

We already started development of new flash 2.5.5 internally and hd256 support will be included. ETA is still under discussion.

Will this be for all officially supported gpus so rx7900xtx and 7900xt included ?

@gardner
Copy link

gardner commented May 21, 2024

Worth mentioning here, about exllamav2 and what it works with - turns out this version isn't really of much use - here is their reply... turboderp/exllamav2#397 (comment)

This is the important part from that issue:

flash-attn introduced a crucial change in 2.1.0 without which it's really kind of useless for generating text. Before this it only worked with k_len = q_len or q_len = 1, ruling out features like cache reuse, speculative decoding and chunked prefill. ExLlama used to have some workarounds, but they were problematic and mostly just ended up disabling flash-attn anyway.

So I would say supporting 2.0.4 is hard. 2.1.0 should be possible (although it checks for version 2.2.1 at the moment.)

@mirh
Copy link

mirh commented Sep 14, 2024

Progressing

@evshiron
Copy link

evshiron commented Sep 15, 2024

AOTriton has been updated in PyTorch nightly to add support for Navi31, which enables Triton-based Flash Attention in PyTorch for Navi31 out of the box:

@Said-Akbar
Copy link

@evshiron , does this Triton-based Flash Attention support AMD MI60/gfx906 cards?

@evshiron
Copy link

evshiron commented Nov 9, 2024

@Said-Akbar

I don't have these graphics cards, but in a quick search I could not find any code mentioning gfx906 in the Triton repo, so my answer will be no.

@huanrwan-amd
Copy link

@evshiron , does this Triton-based Flash Attention support AMD MI60/gfx906 cards?

Hi @Said-Akbar , currently Triton-based FA is only supported on Navi.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests