[FlexAttention] Add initial benchmarks #3578

mfrancepillois · 2025-02-28T11:50:25Z

Add benchmarks to evaluate flex attention kernels performances.
Add these benchmarks to CI workflow (need to install a specific pytorch version with XPU FlexAttention support enabled).

Add benchmarks to evaluate flex attention kernels performances. Add these benchmarks to CI workflow (need to install a specific pytorch version with XPU FlexAttention support enabled).

mfrancepillois · 2025-03-03T13:20:36Z

@liangan1, this PR adds two FlexAttention-based benchmarks to start monitoring FlexAttention performance. Could you please have a look?

pbchekin · 2025-03-03T14:40:54Z

.github/actions/load/action.yml

@@ -37,8 +37,12 @@ runs:
        ITEM_PATH="${{ inputs.root }}/${{ inputs.key }}"
        echo "dest=$ITEM_PATH" >> $GITHUB_OUTPUT
        if [[ -d ${{ inputs.path }} ]]; then
-          echo "Directory ${{ inputs.path }} exists and will not be restored from cache"
-          exit 1
+          if [[ ${{ inputs.repository == 'liangan1/pytorch' }} ]]; then


I think we don't need it here. Just add rm -rf pytorch in the workflow.

The workflow has been modified to delete the directory if it already exists (regardless the repository).

.github/actions/setup-pytorch/action.yml

Co-authored-by: Pavel Chekin <[email protected]>

whitneywhtsang · 2025-03-03T17:31:29Z

.github/actions/load/action.yml

@@ -37,8 +37,8 @@ runs:
        ITEM_PATH="${{ inputs.root }}/${{ inputs.key }}"
        echo "dest=$ITEM_PATH" >> $GITHUB_OUTPUT
        if [[ -d ${{ inputs.path }} ]]; then
-          echo "Directory ${{ inputs.path }} exists and will not be restored from cache"
-          exit 1
+          echo "Directory ${{ inputs.path }} already exists and will not be removed"


Not sure I understand, will not be removed, then why remove on the next line?

Not sure to understand what you mean. The original code was modified in response to this comment from Pavel.

@pbchekin Is the current change what you expected? Why echo will not be removed, then remove it right after?

Sorry, my mistake. I didn't read my comment correctly. The comment has been updated.

.github/workflows/triton-benchmarks.yml

etiotto · 2025-03-03T17:42:17Z

.github/actions/load/action.yml

@@ -37,8 +37,8 @@ runs:
        ITEM_PATH="${{ inputs.root }}/${{ inputs.key }}"
        echo "dest=$ITEM_PATH" >> $GITHUB_OUTPUT
        if [[ -d ${{ inputs.path }} ]]; then
-          echo "Directory ${{ inputs.path }} exists and will not be restored from cache"
-          exit 1
+          echo "Directory ${{ inputs.path }} already exists and will not be removed"


I don't understand the comment. The directory here exists and the next line will remove it.

etiotto · 2025-03-03T17:44:26Z

.github/actions/setup-pytorch/action.yml

@@ -45,8 +45,14 @@ runs:
      if: inputs.ref != ''
      shell: bash
      run: |
-        echo "PYTORCH_REPO=${{ inputs.repository }}" | tee -a "$GITHUB_ENV"
-        echo "PYTORCH_COMMIT_ID=${{ steps.commit-id.outputs.commit_id }}" | tee -a "$GITHUB_ENV"
+        if [[ "${{ inputs.repository }}" = "liangan1/pytorch" ]]; then


What is "liangan1" ? Why do we need to use a personal directory ?

Yes, we need to fetch and install a specific pytorch version with XPU support for FlexAttention. Currently this code is only available in Liangang's fork (named liangan1).

.github/workflows/triton-benchmarks.yml

liangan1 · 2025-03-04T00:10:27Z

benchmarks/triton_kernels_benchmark/flex_attention_benchmark_casual_mask.py

+        x_names=['Z', 'H', 'N_CTX', 'D_HEAD', 'CAUSAL', 'MODE'],
+        x_vals=[[z, h, 16384 // z, dhead, causal, mode]
+                for z in [1, 2, 4, 8, 16, 32]
+                for (h, dhead) in [(16, 128), (32, 64)]


Suggest to align the requirements in the https://jira.devtools.intel.com/browse/TRITONXPU-172. e.g., GQA/MHA, paged kv cache. More head dim, sequence length converge.

mfrancepillois · 2025-03-04T14:08:34Z

benchmarks/triton_kernels_benchmark/flex_attention_benchmark_causal_mask.py

+        + [[z, h, 1024, dhead, True, mode]
+           for z in [1, 2, 4, 8, 16, 32, 64]
+           for (h, dhead) in [(8, 128), (32, 96), (4, 128)]
+           for mode in [os.getenv('FA_KERNEL_MODE', 'fwd')]]  #
+        + [[z, h, 1024 + 64, dhead, True, mode]
+           for z in [1, 2, 4, 8, 16, 32]
+           for (h, dhead) in [(8, 128), (32, 96), (4, 128)]
+           for mode in [os.getenv('FA_KERNEL_MODE', 'fwd')]]  #
+        + [[z, h, 1024 + 128 + 512, dhead, True, mode]
+           for z in [1, 2, 4, 8, 16, 32]
+           for (h, dhead) in [(8, 128), (32, 96), (4, 128)]
+           for mode in [os.getenv('FA_KERNEL_MODE', 'fwd')]],  #


Some of the largest classical shape for LLM, specified in https://jira.devtools.intel.com/browse/TRITONXPU-172, cannot be evaluated due resource limitations on PVC.

There are two kernel for Flexattention, flex-attention for prefill and flex-decoding for the decoding stages. Only the sequence length of 16384/z(query, key, value) is covered in this benchmar and this is only for the prefill stage. In the real case, there are prefill(len(q)=len(k)=len(v), decoding(len(q)=1<<=len(k)=len(v)) and extend stage(e.g., multi-round chat. len(q)>1 and len(q)<len(k)=len(v)).

Thanks for these additional explanations. As enhancing the benchmarks to evaluate the performance of other stages (Decode and Append), GQA and paged KV cache requires significant work to improve the benchmarks, I would prefer this PR to focus on adding an initial benchmark for FlexAttention prefill stage only (similar to our current FA benchmark) and address the remaining limitations in different PRs. I have created #3615, #3616, #3617 for this purpose.

Thanks for these additional explanations. As enhancing the benchmarks to evaluate the performance of other stages (Decode and Append), GQA and paged KV cache requires significant work to improve the benchmarks, I would prefer this PR to focus on adding an initial benchmark for FlexAttention prefill stage only (similar to our current FA benchmark) and address the remaining limitations in different PRs. I have created #3615, #3616, #3617 for this purpose.

Make sense. This PR is a good start point.

mfrancepillois added 6 commits February 28, 2025 11:46

[FlexAttention] Add benchmarks

359a332

Add benchmarks to evaluate flex attention kernels performances. Add these benchmarks to CI workflow (need to install a specific pytorch version with XPU FlexAttention support enabled).

Fix tabulation problem.

0d13b22

Improve format

a23f558

Add missing spaces

8b95954

Add file with FlexAttention git commit number.

1d8bfc9

Fix missing CAUSAL param + force pytorch install

520f176

mfrancepillois requested review from pbchekin, whitneywhtsang, etiotto and a team March 3, 2025 12:59

mfrancepillois marked this pull request as ready for review March 3, 2025 12:59

mfrancepillois linked an issue Mar 3, 2025 that may be closed by this pull request

Add FlexAttention to benchmarks/triton_kernels_benchmark #3535

Open

pbchekin requested changes Mar 3, 2025

View reviewed changes

mfrancepillois and others added 3 commits March 3, 2025 17:12

Update .github/actions/setup-pytorch/action.yml

4a86d63

Co-authored-by: Pavel Chekin <[email protected]>

Update .github/actions/setup-pytorch/action.yml

9f56b08

Co-authored-by: Pavel Chekin <[email protected]>

CI Load action: removed directory if already exists

bd5321f

pbchekin approved these changes Mar 3, 2025

View reviewed changes

whitneywhtsang reviewed Mar 3, 2025

View reviewed changes

etiotto reviewed Mar 3, 2025

View reviewed changes

whitneywhtsang reviewed Mar 3, 2025

View reviewed changes

.github/workflows/triton-benchmarks.yml Outdated Show resolved Hide resolved

liangan1 reviewed Mar 4, 2025

View reviewed changes

mfrancepillois added 2 commits March 4, 2025 09:52

Upload flexAttention reports + unify benchmark names

2d3e694

Extend the evaluated shapes + fix benchmark naming issue

747fc67

mfrancepillois commented Mar 4, 2025

View reviewed changes

mfrancepillois added 2 commits March 4, 2025 14:09

Merge branch 'main' into maxime/flexattention_benchmarks

32fe9ff

Keep only prefill shapes

a2e95aa

mfrancepillois changed the title ~~[FlexAttention] Add benchmarks~~ [FlexAttention] Add initial benchmarks Mar 5, 2025

Correct wrong comment.

1d279f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FlexAttention] Add initial benchmarks #3578

[FlexAttention] Add initial benchmarks #3578

mfrancepillois commented Feb 28, 2025

mfrancepillois commented Mar 3, 2025

pbchekin Mar 3, 2025

mfrancepillois Mar 3, 2025

whitneywhtsang Mar 3, 2025

mfrancepillois Mar 4, 2025

whitneywhtsang Mar 4, 2025

mfrancepillois Mar 6, 2025

etiotto Mar 3, 2025

etiotto Mar 3, 2025

mfrancepillois Mar 4, 2025

liangan1 Mar 4, 2025

mfrancepillois Mar 4, 2025

liangan1 Mar 5, 2025

mfrancepillois Mar 5, 2025

liangan1 Mar 5, 2025

[FlexAttention] Add initial benchmarks #3578

Are you sure you want to change the base?

[FlexAttention] Add initial benchmarks #3578

Conversation

mfrancepillois commented Feb 28, 2025

mfrancepillois commented Mar 3, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment