Need a concise pipeline for best practices in implementing large model loading from pretrained and resume training under FSDP/TP scenarios? #1038
-
Could someone help provide a concise pipeline for best practices in implementing large model loading from pretrained (single file) and resume training (from DCP-saved checkpoints) under FSDP/TP scenarios? Given that the model parameters are too large to fit on a single GPU, the main focus is on optimizing CPU memory and GPU VRAM usage while balancing loading efficiency. Additionally, how to adapt these operations to hybrid parallel training strategies like FSDP/TP. Although TorchTitan is an excellent project, its high abstraction level makes finding such minimal implementations time-consuming. Would appreciate a simple summary of a minimal pipeline. |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 16 replies
-
@tangjiasheng Since TorchTitan already support training very large models from scratch out-of-box, is the gap mainly from loading the pretrained checkpoints? What's the format of the pretrained checkpoint? Is it a very large file generated by |
Beta Was this translation helpful? Give feedback.
-
Yes. Like loading a pretrained checkpoint from ONE downloaded file. |
Beta Was this translation helpful? Give feedback.
-
What format are you using? Even huggingface doesn't assume one large file. We do have plan to add some template converter for certain models and support certain formats, like huggingface and demonstrate how to use it. Overall, we expect users to first convert the file to DCP format using the Converter. If the Converter doesn't support the format, users have to modify it. Then users could just use the converted DCP files to do the post-training by using the default TorchTitan train flow to do post training. Does this answer your question? |
Beta Was this translation helpful? Give feedback.
-
Hey thanks for the question. I'd like to understand your request better.
Since you said "complex parallelism strategies", you are acknowledging that it's complex. When you ask for "a concise pipeline for construct", it sounds contradictory to the complex nature. Are you looking for something customized for your particular use case (e.g. using 32 GPUs, model size is 8B, load from checkpoint, apply FSDP and TP only)? Everyone's use case is different, so the combinatorial nature makes it hard to have a workflow for every single case. Could you say more about your use case, and what you don't need in torchtitan, so that we can maybe help write a note/tutorial or even trimmed version of code for the workflow, if your case is representative? |
Beta Was this translation helpful? Give feedback.
-
Thanks @wwwjn, this is great. We should also add the checkpointing part as it is always a critical point after people initialize the parallelisms. |
Beta Was this translation helpful? Give feedback.
-
Following our previous discussion, I'm attempting to load a single checkpoint (saved with torch.save) to implement finetuning with ND parallelism using DTensor. However, I haven't been successful in executing this correctly. Initially, I create a model on a meta device and attempt ND parallelism. Then I try to load the single checkpoint file using DCP, but it doesn't seem to work as expected, my way is quite similar to the example. The key difference is that I use fsdp2+tp to parallelize the model. Do you have any suggestions that might help achieve my goal? |
Beta Was this translation helpful? Give feedback.
Thanks for bring this up! Discussed with @tianyu-l yesterday, we plan to make a simple tutorial (on top of a torchtitan fork) to let user focus more on DTensor-based parallelism. Instead of trim current torchtitan, we aim to provide examples of building blocks starting from scratch. Eg, Step1: starting with a model running on single GPU -> Step2: Adding FSDP on top of it -> Step3: Adding more parallelism on top of step2 -> Step4: Adding other features like meta-device initialization -> Step 5: Adding more and more features.
Would you think this would be helpful to the community? Thanks again for your feedback!