Need a concise pipeline for best practices in implementing large model loading from pretrained and resume training under FSDP/TP scenarios? #1038

tangjiasheng · 2025-04-01T10:30:27Z

tangjiasheng
Apr 1, 2025

Could someone help provide a concise pipeline for best practices in implementing large model loading from pretrained (single file) and resume training (from DCP-saved checkpoints) under FSDP/TP scenarios?

Given that the model parameters are too large to fit on a single GPU, the main focus is on optimizing CPU memory and GPU VRAM usage while balancing loading efficiency. Additionally, how to adapt these operations to hybrid parallel training strategies like FSDP/TP. Although TorchTitan is an excellent project, its high abstraction level makes finding such minimal implementations time-consuming. Would appreciate a simple summary of a minimal pipeline.

Answered by wwwjn

Apr 9, 2025

Thanks for bring this up! Discussed with @tianyu-l yesterday, we plan to make a simple tutorial (on top of a torchtitan fork) to let user focus more on DTensor-based parallelism. Instead of trim current torchtitan, we aim to provide examples of building blocks starting from scratch. Eg, Step1: starting with a model running on single GPU -> Step2: Adding FSDP on top of it -> Step3: Adding more parallelism on top of step2 -> Step4: Adding other features like meta-device initialization -> Step 5: Adding more and more features.

Would you think this would be helpful to the community? Thanks again for your feedback!

View full answer

fegin · 2025-04-01T17:35:16Z

fegin
Apr 1, 2025
Collaborator

@tangjiasheng Since TorchTitan already support training very large models from scratch out-of-box, is the gap mainly from loading the pretrained checkpoints? What's the format of the pretrained checkpoint? Is it a very large file generated by torch.save().

0 replies

tangjiasheng · 2025-04-02T06:27:23Z

tangjiasheng
Apr 2, 2025
Author

Yes. Like loading a pretrained checkpoint from ONE downloaded file.
Also I wanna know a concise pipeline for construct the "model creation -> checkpoint loading" for optimizing CPU memory and GPU VRAM usage when apply complex parallism strategies. I think this should be provide as an simple example to enhance futher usage with DTensor, MetaDevice, DCP and other related concepts. Users shouldn't be required to use them all within TorchTitan.

0 replies

fegin · 2025-04-03T18:11:16Z

fegin
Apr 3, 2025
Collaborator

What format are you using? Even huggingface doesn't assume one large file. We do have plan to add some template converter for certain models and support certain formats, like huggingface and demonstrate how to use it.

Overall, we expect users to first convert the file to DCP format using the Converter. If the Converter doesn't support the format, users have to modify it. Then users could just use the converted DCP files to do the post-training by using the default TorchTitan train flow to do post training.

Does this answer your question?

1 reply

tangjiasheng Apr 4, 2025
Author

I use native checkpoint with torch.save(). I do understand a converter can be used for this purpose. And I can try to use it to convert a large ckpt file. But, still, I wanna know the consice construction pipeline, which is a simple, minimally abstracted workflow that isn’t encapsulated like pytorch. e.g. What is the proper order to create model(w/o MetaDevice), resume training from DCP and sharded to multiple nodes?
If provide this workflow is not suitable here at TorchTitan, I will leave and seek assistance in the PyTorch project discussions.
Thanks your response.

tianyu-l · 2025-04-04T17:53:57Z

tianyu-l
Apr 4, 2025
Collaborator

Hey thanks for the question. I'd like to understand your request better.

Also I wanna know a concise pipeline for construct the "model creation -> checkpoint loading" for optimizing CPU memory and GPU VRAM usage when apply complex parallism strategies. I think this should be provide as an simple example to enhance futher usage

Since you said "complex parallelism strategies", you are acknowledging that it's complex. When you ask for "a concise pipeline for construct", it sounds contradictory to the complex nature. Are you looking for something customized for your particular use case (e.g. using 32 GPUs, model size is 8B, load from checkpoint, apply FSDP and TP only)? Everyone's use case is different, so the combinatorial nature makes it hard to have a workflow for every single case.

Could you say more about your use case, and what you don't need in torchtitan, so that we can maybe help write a note/tutorial or even trimmed version of code for the workflow, if your case is representative?

7 replies

tangjiasheng Apr 5, 2025
Author

Sure. Let me share more about my case:
My model is not a LLM one, it's a diffusion model with at most 15B parameters, which quite close to current SOTA open-source models. Thus, my parallel strategy comes to FSDP+TP+SP+(optional) CP. I also consider a ulysses-style SP but it's seems not natural within DTensor, that's another topic and we can ignore this at the moment. Moreover, PP can be ignored here.
So you can understand, I've already had a codebase working on training this model, and what I want to try here is to re-implement it in a DTensor style, which I think is more elegant than NV's Megatron and prefer right now. Thus, bring all the logics into TorchTitan is quite consuming and I want to move the model creation and checkpoint loading logics into the previous codebase, with optimized CPU memory and GPU VRAM usage. That's why I think it's important to provide a concise pipeline (e.g. some logical codes) maybe quite useful for users who want to bring torch-native parallelism to broder implements, like what I'm trying here.

tianyu-l Apr 8, 2025
Collaborator

Got it. Thanks for the context. From what I heard, it seems PP, float8, and torchft are something you don't need at this moment? I believe it's not that hard to trim out those code manually, based on if-else branching? Maintaining another trimmed version of code could be difficult for us. We had hoped that our code is easy enough to read, but it seems it's not the case for you. Could you give feedbacks on which parts of the code are hard to understand or hard to trim? We'll try our best to adapt.

Speaking of your use case, are you aware of our recently started diffusion model training effort in https://github.com/pytorch/torchtitan/tree/main/torchtitan/experiments/flux
It sounds to me that we are building towards something very close to your use case. If you're interested, we welcome your collaboration and feedbacks on this project. cc @wwwjn

tangjiasheng Apr 9, 2025
Author

My focus is not on trimming TorchTitan code to train my model. Instead, I suggest providing a proper example that will enhance users' understanding of combining DTensor-based parallelism, meta-device initialization, and DCP resume. Since these concepts were introduced in recent PyTorch versions, I've been looking for relevant documents or proper example codes but found nothing, except in the TorchTitan project. I don't believe TorchTitan is the problem; on the contrary, TorchTitan provides a great industry-level product to show how to train large models in a torch-native way. However, I still think a concise example should be provided, preferably like a tutorial with a flowchart, to encourage PyTorch users to try these new APIs. Finding the helpful parts or codes from such a highly encapsulated project to build/improve their existing codes (instead of trimming TorchTitan) can be very time-consuming.

This is my suggestion, and that's why I started a discussion instead of an issue. I can find out how to do these things together: perform meta-device initialization -> ND parallelism -> resume training/load pretrained model with DCP, but without checking some detailed code implementations, I am not sure whether there will be pitfalls while optimizing CPU memory and GPU VRAM usage and balancing loading efficiency.

Moreover, the new FLUX part is interesting, and I will keep following these codes.

wwwjn Apr 9, 2025
Collaborator

Thanks for bring this up! Discussed with @tianyu-l yesterday, we plan to make a simple tutorial (on top of a torchtitan fork) to let user focus more on DTensor-based parallelism. Instead of trim current torchtitan, we aim to provide examples of building blocks starting from scratch. Eg, Step1: starting with a model running on single GPU -> Step2: Adding FSDP on top of it -> Step3: Adding more parallelism on top of step2 -> Step4: Adding other features like meta-device initialization -> Step 5: Adding more and more features.

Would you think this would be helpful to the community? Thanks again for your feedback!

Answer selected by tangjiasheng

tangjiasheng Apr 11, 2025
Author

As an additional suggestion, we could incorporate the settings we discussed earlier—such as resuming training or loading from a pretrained model—along with best practices for balancing loading efficiency and memory usage.

Regarding FLUX training, I tried it last year and encountered this issue when attempting partial parallelism on the FLUX model. The MMDiT architecture made it somewhat challenging. I’m not sure if partial parallelism will be properly and efficiently supported this year, but I wanted to mention it as a future consideration when you experiment with FLUX parallelism.

tianyu-l Apr 11, 2025
Collaborator

I tried it last year and encountered pytorch/pytorch#134212 (comment) when attempting partial parallelism

This is workaround for partial TP -- you can just wrap it as a Replicate() DTensor to make every parameter on the same device mesh.

torchtitan/torchtitan/experiments/llama4/infra/parallelize_llama.py

Line 150 in 4f532e0

"moe.router.gate": NoParallel(),

tangjiasheng Apr 14, 2025
Author

Great! Last year I've tried the "Implicit replication support" but it seems that not a persistent solution at that time. I will try NoParallel() if I need in the future.

fegin · 2025-04-14T17:30:02Z

fegin
Apr 14, 2025
Collaborator

Thanks @wwwjn, this is great. We should also add the checkpointing part as it is always a critical point after people initialize the parallelisms.

0 replies

tangjiasheng · 2025-04-23T07:09:17Z

tangjiasheng
Apr 23, 2025
Author

Following our previous discussion, I'm attempting to load a single checkpoint (saved with torch.save) to implement finetuning with ND parallelism using DTensor. However, I haven't been successful in executing this correctly.

Initially, I create a model on a meta device and attempt ND parallelism. Then I try to load the single checkpoint file using DCP, but it doesn't seem to work as expected, my way is quite similar to the example. The key difference is that I use fsdp2+tp to parallelize the model.

Do you have any suggestions that might help achieve my goal?

8 replies

tangjiasheng Apr 27, 2025
Author

# -*- coding: utf-8 -*-
import os
import sys
import gc
import tempfile
import shutil
import contextlib
from typing import Optional, Tuple, Dict, Any

import torch
import torch.nn as nn
import torch.distributed as dist
import torch.distributed.checkpoint as dcp

from torch.distributed._composable.fsdp import fully_shard, MixedPrecisionPolicy
from torch.distributed.tensor.parallel import parallelize_module, ColwiseParallel, RowwiseParallel
from torch.distributed.tensor import DeviceMesh, Replicate, Shard

from torch.distributed.checkpoint.state_dict import get_model_state_dict, StateDictOptions
from torch.distributed.checkpoint.format_utils import BroadcastingTorchSaveReader, DynamicMetaLoadPlanner


class SimpleAttention(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int):
        super().__init__()
        self.embed_dim = embed_dim
        self.num_heads = num_heads
        self.head_dim = embed_dim // num_heads
        assert self.head_dim * num_heads == self.embed_dim, "embed_dim must be divisible by num_heads"
        self.qkv = nn.Linear(embed_dim, embed_dim * 3)
        self.o_proj = nn.Linear(embed_dim, embed_dim)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, S, E = x.shape; qkv = self.qkv(x); q, k, v = qkv.chunk(3, dim=-1)
        q = q.view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, S, self.num_heads, self.head_dim).transpose(1, 2)
        attn_weights = torch.matmul(q, k.transpose(-2, -1)) * (self.head_dim ** -0.5)
        attn_weights = torch.softmax(attn_weights, dim=-1); attn_output = torch.matmul(attn_weights, v)
        attn_output = attn_output.transpose(1, 2).contiguous().view(B, S, E); output = self.o_proj(attn_output)
        return output

class SimpleFFN(nn.Module):
    def __init__(self, embed_dim: int, ffn_dim: int):
        super().__init__(); self.linear1 = nn.Linear(embed_dim, ffn_dim)
        self.activation = nn.ReLU(); self.linear2 = nn.Linear(ffn_dim, embed_dim)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.linear2(self.activation(self.linear1(x)))

class SimpleTransformerLayer(nn.Module):
    def __init__(self, embed_dim: int, num_heads: int, ffn_dim: int):
        super().__init__(); self.norm1 = nn.LayerNorm(embed_dim)
        self.attn = SimpleAttention(embed_dim, num_heads); self.norm2 = nn.LayerNorm(embed_dim)
        self.ffn = SimpleFFN(embed_dim, ffn_dim)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.norm1(x)); x = x + self.ffn(self.norm2(x)); return x

class SimpleTransformer(nn.Module):
    def __init__(self, num_layers: int = 2, embed_dim: int = 128, num_heads: int = 4, ffn_dim: int = 512, vocab_size: int = 100):
        super().__init__(); self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.layers = nn.ModuleList([ SimpleTransformerLayer(embed_dim, num_heads, ffn_dim) for _ in range(num_layers) ])
        self.norm = nn.LayerNorm(embed_dim); self.head = nn.Linear(embed_dim, vocab_size)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = self.embedding(x);
        for layer in self.layers: x = layer(x)
        x = self.norm(x); x = self.head(x); return x


def parallelize_model(model: nn.Module, device_mesh: DeviceMesh, dtype: torch.dtype):
    rank = dist.get_rank()
    if rank == 0: print(f"Rank {rank}: Entering parallelize_model...")
    fsdp_mesh = device_mesh["fsdp"] if "fsdp" in device_mesh.mesh_dim_names else None
    tp_mesh = device_mesh["tp"] if "tp" in device_mesh.mesh_dim_names else None

    if tp_mesh and tp_mesh.size() > 1:
        if rank == 0: print(f"Rank {rank}: Applying DTensor TP strategy on mesh: {tp_mesh} using parallelize_module...")
        tp_plan = {
            "embedding": ColwiseParallel(), "head": RowwiseParallel(output_layouts=Replicate()),
            "layers.*.attn.qkv": ColwiseParallel(), "layers.*.attn.o_proj": RowwiseParallel(),
            "layers.*.ffn.linear1": ColwiseParallel(), "layers.*.ffn.linear2": RowwiseParallel(),
        }
        model = parallelize_module(model, tp_mesh, tp_plan)
        if rank == 0: print(f"Rank {rank}: DTensor TP applied via parallelize_module.")
    else:
        if rank == 0: print(f"Rank {rank}: TP mesh size is 1 or 'tp' not found, skipping TP.")

    if fsdp_mesh and fsdp_mesh.size() > 1:
        if rank == 0: print(f"Rank {rank}: Applying FSDP2 strategy on mesh: {fsdp_mesh} using fully_shard (layer by layer)...")
        mp_policy = MixedPrecisionPolicy(param_dtype=dtype, reduce_dtype=dtype)

        if rank == 0: print(f"Rank {rank}: Applying fully_shard to each layer in model.layers...")
        for layer_id, layer in enumerate(model.layers):
            reshard_after_forward = layer_id < len(model.layers) - 1
            if rank == 0: print(f"Rank {rank}: Wrapping layer {layer_id} with fully_shard (reshard_after_forward={reshard_after_forward})...")
            fully_shard(
                layer,
                mesh=fsdp_mesh,
                mp_policy=mp_policy,
                reshard_after_forward=reshard_after_forward
            )
            if rank == 0: print(f"Rank {rank}: Layer {layer_id} wrapped.")

        if rank == 0: print(f"Rank {rank}: Applying fully_shard to the top-level model (remaining parts)...")
        fully_shard(
             model,
             mesh=fsdp_mesh,
             mp_policy=mp_policy,
        )
        if rank == 0: print(f"Rank {rank}: FSDP2 applied via fully_shard (layer by layer).")
    else:
        if rank == 0: print(f"Rank {rank}: FSDP mesh size is 1 or 'fsdp' not found, skipping FSDP.")
    return model


def load_pretrained_pth(
    model: nn.Module,
    pth_file_path: str
):
    rank = dist.get_rank()
    if rank == 0: print(f"Rank {rank}: Entering load_pretrained_pth for file: {pth_file_path}")

    options = None
    # options = StateDictOptions(cpu_offload=True, full_state_dict=True)
    model_state_dict_structure = get_model_state_dict(model, options=options)

    dcp.load(
        state_dict=model_state_dict_structure,
        storage_reader=BroadcastingTorchSaveReader(pth_file_path),
        planner=DynamicMetaLoadPlanner()
    )
    if rank == 0: print(f"Rank {rank}: dcp.load completed.")

    incompatible_keys = torch.distributed.checkpoint.state_dict.set_model_state_dict(
        model=model, state_dict=model_state_dict_structure, strict=False
    )
    if incompatible_keys.missing_keys or incompatible_keys.unexpected_keys:
         if rank == 0: print(f"Rank {rank} WARNING: Incompatible keys found during set_model_state_dict:")
         if rank == 0 and incompatible_keys.missing_keys: print(f"  Missing: {incompatible_keys.missing_keys}")
         if rank == 0 and incompatible_keys.unexpected_keys: print(f"  Unexpected: {incompatible_keys.unexpected_keys}")

    if rank == 0: print(f"Rank {rank}: Exiting load_pretrained_pth.")

def main():
    from torch.distributed.device_mesh import init_device_mesh
    device_mesh = init_device_mesh("cuda", (4, 1), mesh_dim_names=("fsdp", "tp"))
    rank = device_mesh.get_rank()

    num_layers = 2; embed_dim = 128; num_heads = 4; ffn_dim = 256; vocab_size = 100
    dtype = torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float32

    temp_ckpt_dir = None; pth_file_path = None
    if rank == 0:
        temp_ckpt_dir = tempfile.mkdtemp(prefix="mre_pth_ckpt_v4_")
        pth_file_path = os.path.join(temp_ckpt_dir, "model_orig.pth")
    pth_file_path_list = [pth_file_path]
    dist.broadcast_object_list(pth_file_path_list, src=0)
    pth_file_path = pth_file_path_list[0]
    if not pth_file_path:
        if rank == 0: print(f"Rank {rank} ERROR: Checkpoint path missing.", file=sys.stderr)
        sys.exit(1)

    if rank == 0: print(f"\nRank {rank}: === Step 1: Saving Standard Checkpoint ===")
    if rank == 0:
        model_orig = SimpleTransformer(num_layers, embed_dim, num_heads, ffn_dim, vocab_size).to("cuda")
        state_dict_orig = model_orig.state_dict()
        torch.save(state_dict_orig, pth_file_path)
        del model_orig, state_dict_orig; gc.collect(); torch.cuda.empty_cache()
    dist.barrier()
    if not os.path.exists(pth_file_path):
        if rank == 0: print(f"Rank {rank} ERROR: Checkpoint file {pth_file_path} not created.", file=sys.stderr)
        sys.exit(1)
    if rank == 0: print(f"Rank {rank}: === Step 1 Finished ===\n")

    if rank == 0: print(f"Rank {rank}: === Step 2: Creating and Parallelizing Target Model ===")
    with torch.device('meta'):
        model_to_load = SimpleTransformer(num_layers, embed_dim, num_heads, ffn_dim, vocab_size)
    model_to_load = parallelize_model(model_to_load, device_mesh, dtype)
    if rank == 0: print(f"Rank {rank}: Target model created and parallelized.")
    dist.barrier()
    if rank == 0: print(f"Rank {rank}: === Step 2 Finished ===\n")

    if rank == 0: print(f"Rank {rank}: === Step 3: Loading Standard Checkpoint via Replicated Logic ===")
    load_pretrained_pth(model=model_to_load, pth_file_path=pth_file_path)
    if rank == 0: print(f"Rank {rank}: load_pretrained_pth finished.")

    dist.barrier()
    # if rank == 0 and temp_ckpt_dir and os.path.exists(temp_ckpt_dir):
    #     print(f"Rank 0: Cleaning up temporary directory: {temp_ckpt_dir}")
    #     shutil.rmtree(temp_ckpt_dir)
    #     print("Rank 0: Cleanup complete.")
    dist.destroy_process_group()

if __name__ == "__main__":
    main()

I've finished a mini script to reproduce this error and omit the training loop since it's no need here.
Thank you!

tangjiasheng May 8, 2025
Author

Still stop at here.

fegin May 8, 2025
Collaborator

There are two issues. One is from get_model_state_dict, which I will submit a PR to fix. Another one is that BroadcastingTorchSaveReader is intended to be used to converting "unsharded" torch.save checkpoint files. But in your case, it seems that the torch.save files contains DTensor.

fegin May 8, 2025
Collaborator

pytorch/pytorch#153185

tangjiasheng May 9, 2025
Author

There are two issues. One is from get_model_state_dict, which I will submit a PR to fix. Another one is that BroadcastingTorchSaveReader is intended to be used to converting "unsharded" torch.save checkpoint files. But in your case, it seems that the torch.save files contains DTensor.

Thanks!!!
But I'm not sure which line introduce DTensor into torch.save files...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need a concise pipeline for best practices in implementing large model loading from pretrained and resume training under FSDP/TP scenarios? #1038

{{title}}

Replies: 6 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Need a concise pipeline for best practices in implementing large model loading from pretrained and resume training under FSDP/TP scenarios? #1038

tangjiasheng Apr 1, 2025

Replies: 6 comments · 16 replies

fegin Apr 1, 2025 Collaborator

tangjiasheng Apr 2, 2025 Author

fegin Apr 3, 2025 Collaborator

tangjiasheng Apr 4, 2025 Author

tianyu-l Apr 4, 2025 Collaborator

tangjiasheng Apr 5, 2025 Author

tianyu-l Apr 8, 2025 Collaborator

tangjiasheng Apr 9, 2025 Author

wwwjn Apr 9, 2025 Collaborator

tangjiasheng Apr 11, 2025 Author

tianyu-l Apr 11, 2025 Collaborator

tangjiasheng Apr 14, 2025 Author

fegin Apr 14, 2025 Collaborator

tangjiasheng Apr 23, 2025 Author

tangjiasheng Apr 27, 2025 Author

tangjiasheng May 8, 2025 Author

fegin May 8, 2025 Collaborator

fegin May 8, 2025 Collaborator

tangjiasheng May 9, 2025 Author

tangjiasheng
Apr 1, 2025

Replies: 6 comments 16 replies

fegin
Apr 1, 2025
Collaborator

tangjiasheng
Apr 2, 2025
Author

fegin
Apr 3, 2025
Collaborator

tangjiasheng Apr 4, 2025
Author

tianyu-l
Apr 4, 2025
Collaborator

tangjiasheng Apr 5, 2025
Author

tianyu-l Apr 8, 2025
Collaborator

tangjiasheng Apr 9, 2025
Author

wwwjn Apr 9, 2025
Collaborator

tangjiasheng Apr 11, 2025
Author

tianyu-l Apr 11, 2025
Collaborator

tangjiasheng Apr 14, 2025
Author

fegin
Apr 14, 2025
Collaborator

tangjiasheng
Apr 23, 2025
Author

tangjiasheng Apr 27, 2025
Author

tangjiasheng May 8, 2025
Author

fegin May 8, 2025
Collaborator

fegin May 8, 2025
Collaborator

tangjiasheng May 9, 2025
Author