GPU Pallas `decode_attention` improvements #24602

rdyro · 2024-10-29T21:02:03Z

Changes:

adding mask and bias arguments to decode attention
fixing a bug where q_heads // kv_heads > block_h would cause out-of-bound indexing
adding variable work dispatch (by computing fori_loop max iteration) when mask is present
changing default sm_scale to 1 / math.sqrt(q.shape[-1]) to match jax.nn.dot_product_attention default

rdyro changed the title ~~Decode attention improvements~~ GPU Pallas decode_attention improvements Oct 29, 2024

rdyro requested review from zhangqiaorjc and sharadmv October 29, 2024 21:24

rdyro force-pushed the rdyro-decode-attention-mask branch from ec63f1c to 50f71ab Compare October 30, 2024 01:41

rdyro changed the title ~~GPU Pallas decode_attention improvements~~ [WIP] GPU Pallas decode_attention improvements Oct 31, 2024

rdyro force-pushed the rdyro-decode-attention-mask branch from 50f71ab to 8b0912b Compare October 31, 2024 19:31

rdyro added 10 commits October 31, 2024 12:32

adding mask and bias to the decode kernels

3ee41e0

extending bias to be variable across heads

7b18934

fixing mypy errors

7a5ba0c

variable work to save effort in long contexts

bfbd1c2

adding non-vmap kernel version

5cbfdc4

introducing a non-vmap version

34571ae

going back to vmap version only

12cfa33

using output type for o type

4184623

fixing pre-commit errors

d89c491

removing padding in favor of masked load; add variable_work bool

4e6527a

rdyro force-pushed the rdyro-decode-attention-mask branch from 8b0912b to 4e6527a Compare October 31, 2024 19:32

rdyro changed the title ~~[WIP] GPU Pallas decode_attention improvements~~ GPU Pallas decode_attention improvements Oct 31, 2024

Provide feedback