We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Taylor (pt:300M head:48, head_dim:16, seq_len:2048) 0.62it/s GPU: 32G CausalFullAttention (pt:310M head:8 head_dim:96 seq_len:2048) 1.77it/s GPU: 26G
The text was updated successfully, but these errors were encountered:
also taylor loss reduction is slow than full attention
Sorry, something went wrong.
@junphine the benefits really only come at a certain sequence length, 4096 and beyond
even then, a head dimension of 16 is just too much of a handicap
No branches or pull requests
Taylor (pt:300M head:48, head_dim:16, seq_len:2048) 0.62it/s GPU: 32G
CausalFullAttention (pt:310M head:8 head_dim:96 seq_len:2048) 1.77it/s GPU: 26G
The text was updated successfully, but these errors were encountered: