Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOM while training lora. #149

Open
lbq779660843 opened this issue Jan 20, 2025 · 0 comments
Open

OOM while training lora. #149

lbq779660843 opened this issue Jan 20, 2025 · 0 comments

Comments

@lbq779660843
Copy link

lbq779660843 commented Jan 20, 2025

Hi, I want to train Lora on 4*A5500(24G) but encounter the OOM error.

Command
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora-slot.yaml"

accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

train_configs/test_lora-slot.yaml
model_name: "flux-dev"
data_config:
train_batch_size: 1
num_workers: 4
img_size: 512
img_dir: images-slot/
random_ratio: true # support multi crop preprocessing
report_to: wandb
train_batch_size: 1
output_dir: lora/
max_train_steps: 100000
learning_rate: 1e-5
lr_scheduler: constant
lr_warmup_steps: 10
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1e-8
max_grad_norm: 1.0
logging_dir: logs
mixed_precision: "bf16"
checkpointing_steps: 2500
checkpoints_total_limit: 10
tracker_project_name: lora_test
resume_from_checkpoint: latest
gradient_accumulation_steps: 2
rank: 16
single_blocks: "1,2,3,4"
double_blocks: null
#disable_sampling: false
#sample_every: 250 # sample every this many steps
#sample_width: 1024
#sample_height: 1024
#sample_steps: 20

Part of error message

 [rank2]: Traceback (most recent call last):
 [rank2]:   File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 355, in <module>
 [rank2]:     main()
 [rank2]:   File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 178, in main
 [rank2]:     dit, optimizer, _, lr_scheduler = accelerator.prepare(
 [rank2]:                                       ^^^^^^^^^^^^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1284, in prepare
 [rank2]:     result = self._prepare_deepspeed(*args)
 [rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
 [rank2]:     engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
 [rank2]:                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/__init__.py", line 181, in initialize
 [rank2]:     engine = DeepSpeedEngine(args=args,
 [rank2]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
 [rank2]:     self._configure_distributed_model(model)
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
 [rank2]:     self.module.to(self.device)
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1340, in to
 [rank2]:     return self._apply(convert)
 [rank2]:            ^^^^^^^^^^^^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
 [rank2]:     module._apply(fn)
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
 [rank2]:     module._apply(fn)
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
 [rank2]:     module._apply(fn)
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
 [rank2]:     param_applied = fn(param)
 [rank2]:                     ^^^^^^^^^
 [rank2]:   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1326, in convert
 [rank2]:     return t.to(
 [rank2]:            ^^^^^
 [rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB. GPU 2 has a total capacity of 23.67 GiB of which 17.25 MiB is free. Including non-PyTorch memory, this process has 23.65 GiB memory in use. Of the allocated memory 23.39 GiB is allocated by PyTorch, and 13.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
 W0120 14:21:48.179000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1575915 closing signal SIGTERM
 E0120 14:21:49.497000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1575912) of binary: /home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/python
 Traceback (most recent call last):
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/accelerate", line 8, in <module>
     sys.exit(main())
              ^^^^^^
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
     args.func(args)
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
     deepspeed_launcher(args)
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
     distrib_run.run(args)
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
     elastic_launch(
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
     return launch_agent(self._config, self._entrypoint, list(args))
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
     raise ChildFailedError(
 torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
 ============================================================
 train_flux_lora_deepspeed.py FAILED
 ------------------------------------------------------------
 Failures:
 [1]:
   time      : 2025-01-20_14:21:48
   host      : blwhpx-ThinkStation-PX
   rank      : 1 (local_rank: 1)
   exitcode  : 1 (pid: 1575913)
   error_file: <N/A>
   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 [2]:
   time      : 2025-01-20_14:21:48
   host      : blwhpx-ThinkStation-PX
   rank      : 2 (local_rank: 2)
   exitcode  : 1 (pid: 1575914)
   error_file: <N/A>
   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 ------------------------------------------------------------
 Root Cause (first observed failure):
 [0]:
   time      : 2025-01-20_14:21:48
   host      : blwhpx-ThinkStation-PX
   rank      : 0 (local_rank: 0)
   exitcode  : 1 (pid: 1575912)
   error_file: <N/A>
   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
 ============================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant