Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: with Phi2 + torch_fsdp, got RecursionError: maximum recursion depth exceeded #5487

Open
ericxsun opened this issue Mar 22, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@ericxsun
Copy link
Contributor

馃悰 Describe the bug

While boosting the model using the torch_fsdp plugin and LazyInitContext, a RecursionError occurred: RecursionError: maximum recursion depth exceeded

script:

from modeling_phi import PhiDecoderLayer, PhiForCausalLM

colossalai.launch_from_torch({})
coordinator = DistCoordinator()

with LazyInitContext(default_device=get_current_device()):
    model = PhiForCausalLM.from_pretrained("microsoft/phi-2")


plugin = TorchFSDPPlugin(
    use_orig_params=False,
    forward_prefetch=False,
    auto_wrap_policy=CustomPolicy(lambda m: isintance(m, PhiDecoderLayer)),
)

model, optimizer, _, _, lr_scheduler = booster.boost(model, optimizer, lr_scheduler=lr_scheduler)
...

full stack trace:

The code preceding the AttributeError involves the chunk.clone operation within the def _get_shard function in torch.distributed.fsdp.flat_param.py.

        chunk, numel_to_pad = FlatParamHandle._get_unpadded_shard(
            tensor, rank, world_size
        )
        shard = chunk.clone()

Traceback

Traceback (most recent call last):
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()

  ....
  
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__
    meta_data = func(*args, **{**kwargs, "device": "meta"})
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn
    return t.clone()
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone
    target = LazyTensor(factory_fn, self, meta_data=self._meta_data)
  File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 184, in __new__
    with ConstructorManager.disable():
  File "/opt/conda/lib/python3.10/contextlib.py", line 281, in helper
    return _GeneratorContextManager(func, args, kwds)
  File "/opt/conda/lib/python3.10/contextlib.py", line 102, in __init__
    def __init__(self, func, args, kwds):
RecursionError: maximum recursion depth exceeded

Environment

  • ColossalAI: main branch
  • Pytorch: 2.1.2
  • CUDA: 11.8
  • Transformers: 4.38.2
@ericxsun ericxsun added the bug Something isn't working label Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant