[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

ericxsun · 2024-03-22T03:07:55Z

🐛 Describe the bug

While boosting the model using the torch_fsdp plugin and LazyInitContext, a RecursionError occurred: RecursionError: maximum recursion depth exceeded

script:

from modeling_phi import PhiDecoderLayer, PhiForCausalLM

colossalai.launch_from_torch({})
coordinator = DistCoordinator()

with LazyInitContext(default_device=get_current_device()):
    model = PhiForCausalLM.from_pretrained("microsoft/phi-2")


plugin = TorchFSDPPlugin(
    use_orig_params=False,
    forward_prefetch=False,
    auto_wrap_policy=CustomPolicy(lambda m: isintance(m, PhiDecoderLayer)),
)

model, optimizer, _, _, lr_scheduler = booster.boost(model, optimizer, lr_scheduler=lr_scheduler)
...

full stack trace:

The code preceding the AttributeError involves the chunk.clone operation within the def _get_shard function in torch.distributed.fsdp.flat_param.py.

        chunk, numel_to_pad = FlatParamHandle._get_unpadded_shard(
            tensor, rank, world_size
        )
        shard = chunk.clone()

Traceback

Traceback (most recent call last):
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()

  ....
  
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 184, in __new__
    with ConstructorManager.disable():
  File "/opt/conda/lib/python3.10/contextlib.py", line 281, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "/opt/conda/lib/python3.10/contextlib.py", line 102, in __init__
    def __init__(self, func, args, kwds):
RecursionError: maximum recursion depth exceeded

Environment

ColossalAI: main branch
Pytorch: 2.1.2
CUDA: 11.8
Transformers: 4.38.2

The text was updated successfully, but these errors were encountered:

ericxsun added the bug Something isn't working label Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

ericxsun commented Mar 22, 2024

[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

Comments

ericxsun commented Mar 22, 2024

🐛 Describe the bug

Environment