We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MLA(multi head latent attention)的实现本来是为着提升推理速度,但由于存入缓存的数据比基线(Llama)更大,因此不但未带来任何收益,而且与基线(Llama)相比,占用显存更多,推理更慢。
下面是 DeepSeekV3 HF官网的MLA实现,可见存入KVCache的数据量,比基线(Llama)还大:
下面是推理测速的结果:
The text was updated successfully, but these errors were encountered:
No branches or pull requests
MLA(multi head latent attention)的实现本来是为着提升推理速度,但由于存入缓存的数据比基线(Llama)更大,因此不但未带来任何收益,而且与基线(Llama)相比,占用显存更多,推理更慢。
下面是 DeepSeekV3 HF官网的MLA实现,可见存入KVCache的数据量,比基线(Llama)还大:
下面是推理测速的结果:
The text was updated successfully, but these errors were encountered: