Compared to CausalFullAttention, Taylor is slow to train and use more GPU #3

junphine · 2024-02-29T07:53:44Z

Taylor (pt:300M head:48, head_dim:16, seq_len:2048) 0.62it/s GPU: 32G
CausalFullAttention (pt:310M head:8 head_dim:96 seq_len:2048) 1.77it/s GPU: 26G

junphine · 2024-02-29T11:25:18Z

also taylor loss reduction is slow than full attention

lucidrains · 2024-02-29T16:32:15Z

@junphine the benefits really only come at a certain sequence length, 4096 and beyond

even then, a head dimension of 16 is just too much of a handicap

Provide feedback