You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks to all the people at deepseek who really value technology for this great project, I'm now also reproducing MTP myself for some know-how conclusions and I have an advice about possible clarifications.
The bug in the paper
In section 2.2 line 6[1],
parallelly predicts 𝐷 additional tokens using independent output heads
I fully understand your main claim is the "parallel" in comparison to your "sequentially predict". However, after checking meta's MTP paper [2], in the Section 2 (Column 2, Page 2) line 7,
n independent output heads implemented in terms of transformer layers $f_{hi}$, , and a shared unembedding matrix $f_u$
They use a shared "unembedding head", i.e., lm_head module or output_layer module while the parallel final layers are independent. If you ask me for my implementation, the model final norm block is also shared. So I suggest that the writing here could be changed to:
Different from Gloeckle et al. (2024), which parallelly predicts 𝐷 additional tokens using independent MTP transformer blocks before a shared output head, we let MTP transformer blocks sequentially to predict additional tokens at each prediction depth and keep the complete causal chain.
This also fits well with your Equation.23.
[1] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
[2 Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737.
Best,
The text was updated successfully, but these errors were encountered:
chuhac
changed the title
[Paper BUG] About descriptions of the original MTP, little suggestionDeepSeek-V3 Technical Report
[Paper BUG] About descriptions of the original MTP, little suggestion
Jan 10, 2025
Thanks to all the people at deepseek who really value technology for this great project, I'm now also reproducing MTP myself for some know-how conclusions and I have an advice about possible clarifications.
The bug in the paper
In section 2.2 line 6[1],
I fully understand your main claim is the "parallel" in comparison to your "sequentially predict". However, after checking meta's MTP paper [2], in the Section 2 (Column 2, Page 2) line 7,
They use a shared "unembedding head", i.e., lm_head module or output_layer module while the parallel final layers are independent. If you ask me for my implementation, the model final norm block is also shared. So I suggest that the writing here could be changed to:
This also fits well with your Equation.23.
[1] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
[2 Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737.
Best,
The text was updated successfully, but these errors were encountered: