[Paper BUG] About descriptions of the original MTP, little suggestion #252

chuhac · 2025-01-10T03:01:12Z

Thanks to all the people at deepseek who really value technology for this great project, I'm now also reproducing MTP myself for some know-how conclusions and I have an advice about possible clarifications.

The bug in the paper
In section 2.2 line 6[1],

parallelly predicts 𝐷 additional tokens using independent output heads

I fully understand your main claim is the "parallel" in comparison to your "sequentially predict". However, after checking meta's MTP paper [2], in the Section 2 (Column 2, Page 2) line 7,

n independent output heads implemented in terms of transformer layers $f_{hi}$, , and a shared unembedding matrix $f_u$

They use a shared "unembedding head", i.e., lm_head module or output_layer module while the parallel final layers are independent. If you ask me for my implementation, the model final norm block is also shared. So I suggest that the writing here could be changed to:

Different from Gloeckle et al. (2024), which parallelly predicts 𝐷 additional tokens using independent MTP transformer blocks before a shared output head, we let MTP transformer blocks sequentially to predict additional tokens at each prediction depth and keep the complete causal chain.

This also fits well with your Equation.23.

[1] Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., ... & Piao, Y. (2024). DeepSeek-V3 Technical Report. arXiv preprint arXiv:2412.19437.
[2 Gloeckle, F., Idrissi, B. Y., Rozière, B., Lopez-Paz, D., & Synnaeve, G. (2024). Better & faster large language models via multi-token prediction. arXiv preprint arXiv:2404.19737.

Best,

16x3b · 2025-01-12T10:29:19Z

You seem to grasp the concept of MTP well. What is the novelty and hubbub of MTP all about? I'm not sure I understand the premise. Are you able to explain the concept in simple terms for the uninitiated?

chuhac · 2025-01-12T12:32:27Z

MTP seeks to help model predicts more than one tokens during one forward pass, which I call it "Ability to plan ahead", which may be especially useful in math or code domains.

About training: MTP makes the model to learn to generate several tokens and plan the next-k-tokens in advance, which gives the model additional capabilities on top of next-(one) token prediction. In order to achieve this effect, Deepseek-V3 or Meta MTP inserts extra learnable parameters which utilizes the hidden states of backbone decoder to predict k tokens afterwards. However, this inclusion of additional modules places demands on pipeline parallelism and requires additional optimization.

About inference: Though deepseek didn't release the weights of MTP head, the multi head prediction can predict several tokens in a single forward pass, which can be utilized for speculative decoding or accelerate the inference throughput. If anyone wants to give it a try, Meta has their MTP weights released here: https://huggingface.co/facebook/multi-token-prediction

zhaoyang-star · 2025-02-11T09:03:02Z

@chuhac I am also tring to reproducing the MTP based on a Dense model. Could you mind share the implements of training with MTP? Such as which framework is chosen. Thanks in advance. I used Megatron-LM a lot but it is too heavy to port MTP into it. I am planing to port MTP into Deepspeed.

chuhac · 2025-02-12T04:50:26Z

@zhaoyang-star For copyright reasons I don't have a plan to public the Megatron code for MTP. But if you are interested to build it on top of relatively small size dense models, you can try doing minor modifications to the implementations of transformers. From my point of view it won't take too much effort compared to Megatron since you don't need any model parallel here. Last, by mentioning DeepSpeed you just need the ZeRO instead of its pipeline parallel, right ?

zhaoyang-star · 2025-02-12T07:08:47Z

@chuhac Thanks for your quick feedback. Allright, I plan to build MTP on Megatron-LM and then do some experiments to verify the model performance gained by MTP.

16x3b · 2025-02-15T10:41:43Z

@chuhac

Thank you for your insights! I will certainly do more reading to understanding this better, but also..

@zhaoyang-star

If you are interested in collaborating with someone on this subject of integrating MTP training into megatron-LM, I am curious to learn more and we can perhaps use my hardware, as well as, whatever resources you have to do training faster.

chuhac changed the title ~~[Paper BUG] About descriptions of the original MTP, little suggestionDeepSeek-V3 Technical Report~~ [Paper BUG] About descriptions of the original MTP, little suggestion Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Paper BUG] About descriptions of the original MTP, little suggestion #252

[Paper BUG] About descriptions of the original MTP, little suggestion #252

chuhac commented Jan 10, 2025 •

edited

Loading

16x3b commented Jan 12, 2025

chuhac commented Jan 12, 2025

zhaoyang-star commented Feb 11, 2025

chuhac commented Feb 12, 2025

zhaoyang-star commented Feb 12, 2025

16x3b commented Feb 15, 2025

[Paper BUG] About descriptions of the original MTP, little suggestion #252

[Paper BUG] About descriptions of the original MTP, little suggestion #252

Comments

chuhac commented Jan 10, 2025 • edited Loading

16x3b commented Jan 12, 2025

chuhac commented Jan 12, 2025

zhaoyang-star commented Feb 11, 2025

chuhac commented Feb 12, 2025

zhaoyang-star commented Feb 12, 2025

16x3b commented Feb 15, 2025

chuhac commented Jan 10, 2025 •

edited

Loading