Merge to upstream flash-attention repo #35

ehartford · 2024-01-18T21:34:52Z

I am requesting that you merge with the upstream flash-attention repo, in order to garner community engagement and improving integration and distribution.

This separation is a major blocker to AMD adoption for AI model training and inference use cases.

Dao-AILab#707

Taylantz · 2024-01-21T08:01:43Z

I too am hoping for a merge, currently this repo is some versions behind the upstream flash-attention repo which breaks compatibilitly with other open source projects that use flash-attention. For Example exllamav2 as described below:

oobabooga#3759

It feels like we are just a step away where where ROCm could be used in a lot of projects that would help owners of consumer gpus. Flash-Attention is a big step towards that goal. I am grateful for all the contributors!

dejay-vu · 2024-01-23T10:13:46Z

There will be a new version of Composable Kernel coming out shortly. We will redesign the FA based on that one so I think we can request to merge into the upstream for that new version.

fxmarty · 2024-02-29T13:35:04Z

@howiejayz @sabreshao Are there any updates regarding bumping the version?

xxtars · 2024-04-01T07:00:48Z

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.

Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

nktice · 2024-04-03T22:11:05Z

Worth mentioning here, about exllamav2 and what it works with -
turns out this version isn't really of much use - here is their reply...
turboderp-org/exllamav2#397 (comment)

fxmarty · 2024-04-05T08:18:15Z

@nktice Agreed, in Transformers we had to hack something to keep compatibility with 2.0.4 https://github.com/huggingface/transformers/blob/1ab71364886010c31b20dd8c8bb0c60f8a0681ad/src/transformers/models/llama/modeling_llama.py#L418

sabreshao · 2024-04-08T10:25:35Z

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.

Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

We already started development of new flash 2.5.5 internally and hd256 support will be included. ETA is still under discussion.

Kademo15 · 2024-04-10T04:01:55Z

@howiejayz Is there going to be a new version upgrade? Recently, I've been using an MI250X to train Gemma and encountered some difficulties #51. I'm not sure if it's due to the version. I noticed that the upstream version mentions "All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there's no dropout) as of flash-attn 2.5.5." Currently, the branch version seems to be 2.0.4.
Do you have any suggestions regarding this issue? Any help would be greatly appreciated.

We already started development of new flash 2.5.5 internally and hd256 support will be included. ETA is still under discussion.

Will this be for all officially supported gpus so rx7900xtx and 7900xt included ?

gardner · 2024-05-21T08:55:34Z

Worth mentioning here, about exllamav2 and what it works with - turns out this version isn't really of much use - here is their reply... turboderp/exllamav2#397 (comment)

This is the important part from that issue:

flash-attn introduced a crucial change in 2.1.0 without which it's really kind of useless for generating text. Before this it only worked with k_len = q_len or q_len = 1, ruling out features like cache reuse, speculative decoding and chunked prefill. ExLlama used to have some workarounds, but they were problematic and mostly just ended up disabling flash-attn anyway.

So I would say supporting 2.0.4 is hard. 2.1.0 should be possible (although it checks for version 2.2.1 at the moment.)

mirh · 2024-09-14T19:53:06Z

Progressing

evshiron · 2024-09-15T01:21:53Z

AOTriton has been updated in PyTorch nightly to add support for Navi31, which enables Triton-based Flash Attention in PyTorch for Navi31 out of the box:

[ROCm] Update to AOTriton 0.7b pytorch/pytorch#134498

Said-Akbar · 2024-11-08T22:36:54Z

@evshiron , does this Triton-based Flash Attention support AMD MI60/gfx906 cards?

evshiron · 2024-11-09T02:49:02Z

@Said-Akbar

I don't have these graphics cards, but in a quick search I could not find any code mentioning gfx906 in the Triton repo, so my answer will be no.

huanrwan-amd · 2025-02-11T23:07:37Z

@evshiron , does this Triton-based Flash Attention support AMD MI60/gfx906 cards?

Hi @Said-Akbar , currently Triton-based FA is only supported on Navi.

sabreshao added the upstream label Feb 7, 2024

Taylantz mentioned this issue Apr 2, 2024

AMD thread oobabooga/text-generation-webui#3759

Open

ppanchad-amd added the Under Investigation label Feb 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge to upstream flash-attention repo #35

Merge to upstream flash-attention repo #35

ehartford commented Jan 18, 2024

Taylantz commented Jan 21, 2024

dejay-vu commented Jan 23, 2024

fxmarty commented Feb 29, 2024

xxtars commented Apr 1, 2024

nktice commented Apr 3, 2024

fxmarty commented Apr 5, 2024

sabreshao commented Apr 8, 2024

Kademo15 commented Apr 10, 2024 •

edited

Loading

gardner commented May 21, 2024

mirh commented Sep 14, 2024

evshiron commented Sep 15, 2024 •

edited

Loading

Said-Akbar commented Nov 8, 2024

evshiron commented Nov 9, 2024

huanrwan-amd commented Feb 11, 2025

Merge to upstream flash-attention repo #35

Merge to upstream flash-attention repo #35

Comments

ehartford commented Jan 18, 2024

Taylantz commented Jan 21, 2024

dejay-vu commented Jan 23, 2024

fxmarty commented Feb 29, 2024

xxtars commented Apr 1, 2024

nktice commented Apr 3, 2024

fxmarty commented Apr 5, 2024

sabreshao commented Apr 8, 2024

Kademo15 commented Apr 10, 2024 • edited Loading

gardner commented May 21, 2024

mirh commented Sep 14, 2024

evshiron commented Sep 15, 2024 • edited Loading

Said-Akbar commented Nov 8, 2024

evshiron commented Nov 9, 2024

huanrwan-amd commented Feb 11, 2025

Kademo15 commented Apr 10, 2024 •

edited

Loading

evshiron commented Sep 15, 2024 •

edited

Loading