Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Any plans about flash attention v2? #25

Open
jackbravo opened this issue Sep 25, 2024 · 7 comments
Open

Any plans about flash attention v2? #25

jackbravo opened this issue Sep 25, 2024 · 7 comments

Comments

@jackbravo
Copy link

Any plans on upgrading this repo for v2 of flash-attention?

@Nidal890
Copy link

Any plans on upgrading this repo for v2 of flash-attention?

Are you referring to flashattention v3? Because this repo has just been upgraded to v2 of the flashattention FA algorithm over the summer. I think the author said that he's looking into FA v3 or pytorch's flexattention or something like that.

@jackbravo
Copy link
Author

No, v2. Sorry, from the last released version I saw (v1.0.1), and the commit history, and searching the repo, I couldn't find any reference or pointer as of if this fork supports v2 of flash-attention. That's why I asked.

@jackbravo
Copy link
Author

And I think I'm just shooting in the dark. Sorry, I thought I could use this repo as a replacement for the python flash-attention project/package on macos. But seeing that this is a swift implementation of the algorithm, I don't think that is possible.

I was following the README on https://github.com/QwenLM/Qwen2-VL that mentions that you can use flash_attention_2 to speed up inference, but the python project seems to run only on CUDA.

@philipturner
Copy link
Owner

FlashAttention v3 was an algorithm specialized for the H100 chip. It doesn't support backward pass or other hardware. You could argue that the metal-flash-attention repo is an alternative "3rd version" that specialized for Apple hardware instead of Nvidia hardware. It improves on FlashAttention v2 by fixing some parallelization/complexity bottlenecks.

But seeing that this is a swift implementation of the algorithm, I don't think that is possible.

You can just translate the code to your desired language. That's been used before, as I've have someone translate both the GEMM and forward FlashAttention code to C++.

@philipturner
Copy link
Owner

philipturner commented Sep 26, 2024

This is how the repo differs from FlashAttention v2:

Dao-AILab/flash-attention#1172

"v2" of this repository has nothing to do with the versioning in DaoAILab/flash-attention. The "v1" of this repository was an implementation of DaoAILab "v2", but only forward pass. The "v2" of this repository was an implementation of DaoAILab "v2", but both forward and backward pass.

For MFA v2, I removed the pre-compiled .metallib and went with code generation, which you can translate to your desired source language in a self-contained set of source files.

@jackbravo
Copy link
Author

Is there a public repo for the translation to C++?

@philipturner
Copy link
Owner

philipturner commented Sep 26, 2024

This repo under the Documentation archive folder. A C++ translation of an older version of GEMM.

github.com/philipturner/metal-flash-attention

Somebody else’s C++ translation of the newer GEMM and only the forward part of FlashAttention. Look through the commit history or PR history and you’ll find what you’re looking for.

github.com/liuliu/ccv

C++ attention of backward gradient for training models (the whole point of doing this, because forward inference is easy AF). Not explicitly translated, but you could do it with enough time to invest.

Like any code, it will not compile right away verbatim in whatever compiler you have. It is a reference that you read through, customize for your application. Liu customized the kernels a bit, so they deviate from the source tree’s original goals of eliminating the fluff (batching, multi-head attention, masks, attention with linear bias, GQA, block sparsity, and a few other dozen I don’t know about). Hence I am not holding anything but my own personal translations in the source tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants