We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
While boosting the model using the torch_fsdp plugin and LazyInitContext, a RecursionError occurred: RecursionError: maximum recursion depth exceeded
torch_fsdp
LazyInitContext
RecursionError: maximum recursion depth exceeded
script:
from modeling_phi import PhiDecoderLayer, PhiForCausalLM colossalai.launch_from_torch({}) coordinator = DistCoordinator() with LazyInitContext(default_device=get_current_device()): model = PhiForCausalLM.from_pretrained("microsoft/phi-2") plugin = TorchFSDPPlugin( use_orig_params=False, forward_prefetch=False, auto_wrap_policy=CustomPolicy(lambda m: isintance(m, PhiDecoderLayer)), ) model, optimizer, _, _, lr_scheduler = booster.boost(model, optimizer, lr_scheduler=lr_scheduler) ...
full stack trace:
The code preceding the AttributeError involves the chunk.clone operation within the def _get_shard function in torch.distributed.fsdp.flat_param.py.
AttributeError
chunk.clone
def _get_shard
torch.distributed.fsdp.flat_param.py.
chunk, numel_to_pad = FlatParamHandle._get_unpadded_shard( tensor, rank, world_size ) shard = chunk.clone()
Traceback
Traceback (most recent call last): File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone target = LazyTensor(factory_fn, self, meta_data=self._meta_data) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__ meta_data = func(*args, **{**kwargs, "device": "meta"}) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn return t.clone() File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone target = LazyTensor(factory_fn, self, meta_data=self._meta_data) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__ meta_data = func(*args, **{**kwargs, "device": "meta"}) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn return t.clone() .... File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone target = LazyTensor(factory_fn, self, meta_data=self._meta_data) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__ meta_data = func(*args, **{**kwargs, "device": "meta"}) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn return t.clone() File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone target = LazyTensor(factory_fn, self, meta_data=self._meta_data) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 186, in __new__ meta_data = func(*args, **{**kwargs, "device": "meta"}) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 392, in factory_fn return t.clone() File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 394, in clone target = LazyTensor(factory_fn, self, meta_data=self._meta_data) File ".local/lib/python3.10/site-packages/colossalai/lazy/lazy_init.py", line 184, in __new__ with ConstructorManager.disable(): File "/opt/conda/lib/python3.10/contextlib.py", line 281, in helper return _GeneratorContextManager(func, args, kwds) File "/opt/conda/lib/python3.10/contextlib.py", line 102, in __init__ def __init__(self, func, args, kwds): RecursionError: maximum recursion depth exceeded
The text was updated successfully, but these errors were encountered:
No branches or pull requests
馃悰 Describe the bug
While boosting the model using the
torch_fsdp
plugin andLazyInitContext
, a RecursionError occurred:RecursionError: maximum recursion depth exceeded
script:
full stack trace:
The code preceding the
AttributeError
involves thechunk.clone
operation within thedef _get_shard
function intorch.distributed.fsdp.flat_param.py.
Traceback
Environment
The text was updated successfully, but these errors were encountered: