Add Implementation of Native Sparse Attention by yukavio · Pull Request #137 · HazyResearch/ThunderKittens

yukavio · 2025-07-22T11:20:37Z

This PR try to add Implementation of Compressed Attention and Selected Attention of Native Sparse Attention

The hyperparameter of selected and compressed attention kernel is setting for good performance on H20. It should be changed if we want to get better performance on other devices.
This PR is not ready for merging. I will reorganize the code and add details of performance metrics for this PR this week.

The full implementation which could be used to training the Native Sparse Model could be find at https://github.com/yukavio/nsa/tree/main/. The current codebase is implemented with Triton, but we will soon switch to the kernel introduced in this PR for better performance. This is my first time contributing code to the ThunderKittens community, and I welcome any suggestions for improvement from the community.

…128 need to improve

…d be improve

…ome optimization opportunity

yukavio and others added 26 commits June 13, 2025 03:23

change layout to BTHD

0940e2f

refine compress attn fwd BTHD

5edd9e6

support compression attn layout

e634e19

support short kv, but bwd have perf problem

6eebef4

compress fwd tested

825a319

fix compress attn bwd performance

d010761

kv grad have some precision gap in tail

55bedf1

complete compress attn

b34d852

add nsa compress attn op for tk

170c7f6

add selection attn skeleton

ec14b1b

add mha bthd backup

c5ff6f8

refine nsa skeleton

f1335fd

add selection attn no causal without indices

2708063

selection attn fwd with Head Dim == 64 tested

517afad

complete selection attn fwd

5fd3d31

add bwd for selection attn without causal and indices, perf of width=…

5491796

…128 need to improve

improve performance of selection attention bwd with width=128

ff88a10

support selection attn causal=True

55ce9d7

add indices input for selection attn, performance of fwd kernel shoul…

6b3af43

…d be improve

remove consumer warpgroups

ffb69ea

optimize fwd performance of selection attn

5972fbc

complete the implementation of selection attn fwd, mayby still have s…

fe3969a

…ome optimization opportunity

need remove multi consumer warps of selection attn bwd

bb893eb

complete the implementation of selection attention

9c20ae2

clean the code for PR

2fd06d6

add unit test for nsa attention

0d5f05d

StuartSul force-pushed the main branch from cdb5fea to 50f75fd Compare September 15, 2025 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Implementation of Native Sparse Attention#137

Add Implementation of Native Sparse Attention#137
yukavio wants to merge 26 commits intoHazyResearch:mainfrom
yukavio:nsa

yukavio commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yukavio commented Jul 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant