Any plans about flash attention v2? #25

jackbravo · 2024-09-25T17:45:27Z

Any plans on upgrading this repo for v2 of flash-attention?

Nidal890 · 2024-09-25T18:54:17Z

Any plans on upgrading this repo for v2 of flash-attention?

Are you referring to flashattention v3? Because this repo has just been upgraded to v2 of the flashattention FA algorithm over the summer. I think the author said that he's looking into FA v3 or pytorch's flexattention or something like that.

jackbravo · 2024-09-26T01:13:17Z

No, v2. Sorry, from the last released version I saw (v1.0.1), and the commit history, and searching the repo, I couldn't find any reference or pointer as of if this fork supports v2 of flash-attention. That's why I asked.

jackbravo · 2024-09-26T01:29:22Z

And I think I'm just shooting in the dark. Sorry, I thought I could use this repo as a replacement for the python flash-attention project/package on macos. But seeing that this is a swift implementation of the algorithm, I don't think that is possible.

I was following the README on https://github.com/QwenLM/Qwen2-VL that mentions that you can use flash_attention_2 to speed up inference, but the python project seems to run only on CUDA.

philipturner · 2024-09-26T12:12:32Z

FlashAttention v3 was an algorithm specialized for the H100 chip. It doesn't support backward pass or other hardware. You could argue that the metal-flash-attention repo is an alternative "3rd version" that specialized for Apple hardware instead of Nvidia hardware. It improves on FlashAttention v2 by fixing some parallelization/complexity bottlenecks.

But seeing that this is a swift implementation of the algorithm, I don't think that is possible.

You can just translate the code to your desired language. That's been used before, as I've have someone translate both the GEMM and forward FlashAttention code to C++.

philipturner · 2024-09-26T12:13:07Z

This is how the repo differs from FlashAttention v2:

Dao-AILab/flash-attention#1172

"v2" of this repository has nothing to do with the versioning in DaoAILab/flash-attention. The "v1" of this repository was an implementation of DaoAILab "v2", but only forward pass. The "v2" of this repository was an implementation of DaoAILab "v2", but both forward and backward pass.

For MFA v2, I removed the pre-compiled .metallib and went with code generation, which you can translate to your desired source language in a self-contained set of source files.

jackbravo · 2024-09-26T12:35:27Z

Is there a public repo for the translation to C++?

philipturner · 2024-09-26T17:14:16Z

This repo under the Documentation archive folder. A C++ translation of an older version of GEMM.

github.com/philipturner/metal-flash-attention

Somebody else’s C++ translation of the newer GEMM and only the forward part of FlashAttention. Look through the commit history or PR history and you’ll find what you’re looking for.

github.com/liuliu/ccv

C++ attention of backward gradient for training models (the whole point of doing this, because forward inference is easy AF). Not explicitly translated, but you could do it with enough time to invest.

Like any code, it will not compile right away verbatim in whatever compiler you have. It is a reference that you read through, customize for your application. Liu customized the kernels a bit, so they deviate from the source tree’s original goals of eliminating the fluff (batching, multi-head attention, masks, attention with linear bias, GQA, block sparsity, and a few other dozen I don’t know about). Hence I am not holding anything but my own personal translations in the source tree.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any plans about flash attention v2? #25

Any plans about flash attention v2? #25

jackbravo commented Sep 25, 2024

Nidal890 commented Sep 25, 2024

jackbravo commented Sep 26, 2024

jackbravo commented Sep 26, 2024

philipturner commented Sep 26, 2024

philipturner commented Sep 26, 2024 •

edited

Loading

jackbravo commented Sep 26, 2024

philipturner commented Sep 26, 2024 •

edited

Loading

Any plans about flash attention v2? #25

Any plans about flash attention v2? #25

Comments

jackbravo commented Sep 25, 2024

Nidal890 commented Sep 25, 2024

jackbravo commented Sep 26, 2024

jackbravo commented Sep 26, 2024

philipturner commented Sep 26, 2024

philipturner commented Sep 26, 2024 • edited Loading

jackbravo commented Sep 26, 2024

philipturner commented Sep 26, 2024 • edited Loading

philipturner commented Sep 26, 2024 •

edited

Loading

philipturner commented Sep 26, 2024 •

edited

Loading