[Splash Attention] Remove unnecessary `head_dim_v % NUM_LANES == 0` constraint #27427

ds-hwang · 2025-03-25T18:00:08Z

Hi JAX team and @sharadmv,
I found an unnecessary constraint in the flash_attention_kernel implementation in the Splash Attention module that prevents small models from using it.

Currently, the kernel enforces that head_dim_v must be a multiple of NUM_LANES (128):

head_dim_v_repeats, rem = divmod(head_dim_v, NUM_LANES)
if rem != 0:
    raise NotImplementedError(
        f"{head_dim_v=} should be a multiple of {NUM_LANES}"
    )

Source:

jax/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py

Lines 714 to 718 in 49aad1b

    
           head_dim_v_repeats, rem = divmod(head_dim_v, NUM_LANES) 
        
           if rem != 0: 
        
             raise NotImplementedError( 
        
                 f"{head_dim_v=} should be a multiple of {NUM_LANES}" 
        
             )

However, this constraint is unnecessary. It arises from the way head_dim_v_repeats is used to reshape alpha:

alpha_o = pltpu.repeat(alpha, head_dim_v_repeats, axis=1)

This reshape assumes alpha has shape [bq, NUM_LANES] and change it to [bq, head_dim]. However in fact, alpha is $(d_{i-1}' * e^{m_{i-1} - m_i}) / d_i$ in Algorithm FlashAttention (Tiling) and it should have shape [bq, 1], not [bq, NUM_LANES].

In the code:

$m_i$ corresponds to m_scratch
$d_i'$ corresponds to l_scratch

Currently, both are unnecessarily allocated with shape [bq, NUM_LANES], causing all NUM_LANES elements to redundantly store the same value.
Source:

jax/jax/experimental/pallas/ops/tpu/splash_attention/splash_attention_kernel.py

Lines 1056 to 1057 in 49aad1b

    
           jax.ShapeDtypeStruct((bq, NUM_LANES), jnp.float32),  # m_scratch 
        
           jax.ShapeDtypeStruct((bq, NUM_LANES), jnp.float32),  # l_scratch

To be clear, let me annotate the shape in the Flash attention algorithm in https://courses.cs.washington.edu/courses/cse599m/23sp/notes/flashattn.pdf

Why this matters

By changing the shape of m_scratch, l_scratch, and alpha to [bq, 1]:

We avoid unnecessary memory and compute overhead.
We eliminate the need to repeat(alpha, head_dim_v_repeats, ...).
Most importantly, we can remove the head_dim_v % NUM_LANES == 0 constraint entirely.

This enables smaller models (e.g. sLLMs) to use Splash Attention.

Unfortunately, due to my company's internal policy, I’m unable to submit a PR myself. I hope the JAX maintainers can address this issue — fixing it should not require major changes and will benefit a wider range of users.

Thanks for your excellent work!

The text was updated successfully, but these errors were encountered:

ds-hwang · 2025-04-02T20:04:29Z

@jakevdp, @mattjj, @sharadmv Could you take a look when you have a room? Thank you.

ds-hwang added the enhancement New feature or request label Mar 25, 2025

SwarnimShekhar mentioned this issue Mar 26, 2025

[Splash Attention] Remove Unnecessary head_dim_v Constraint and Update Scratch Array Shapes #27427 #27461

Open

jakevdp assigned sharadmv Apr 2, 2025

Rifur13 self-assigned this Apr 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Splash Attention] Remove unnecessary `head_dim_v % NUM_LANES == 0` constraint #27427

[Splash Attention] Remove unnecessary `head_dim_v % NUM_LANES == 0` constraint #27427

ds-hwang commented Mar 25, 2025

ds-hwang commented Apr 2, 2025

[Splash Attention] Remove unnecessary head_dim_v % NUM_LANES == 0 constraint #27427

[Splash Attention] Remove unnecessary head_dim_v % NUM_LANES == 0 constraint #27427

Comments

ds-hwang commented Mar 25, 2025

Why this matters

ds-hwang commented Apr 2, 2025

[Splash Attention] Remove unnecessary `head_dim_v % NUM_LANES == 0` constraint #27427

[Splash Attention] Remove unnecessary `head_dim_v % NUM_LANES == 0` constraint #27427