You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[rank2]: Traceback (most recent call last):
[rank2]: File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 355, in<module>
[rank2]: main()
[rank2]: File "/home/lbq/Codes/x-flux/train_flux_lora_deepspeed.py", line 178, in main
[rank2]: dit, optimizer, _, lr_scheduler = accelerator.prepare(
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1284, in prepare
[rank2]: result = self._prepare_deepspeed(*args)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/accelerator.py", line 1751, in _prepare_deepspeed
[rank2]: engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/__init__.py", line 181, in initialize
[rank2]: engine = DeepSpeedEngine(args=args,
[rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 262, in __init__
[rank2]: self._configure_distributed_model(model)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 1103, in _configure_distributed_model
[rank2]: self.module.to(self.device)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1340, in to
[rank2]: return self._apply(convert)
[rank2]: ^^^^^^^^^^^^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 900, in _apply
[rank2]: module._apply(fn)
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 927, in _apply
[rank2]: param_applied = fn(param)
[rank2]: ^^^^^^^^^
[rank2]: File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1326, in convert
[rank2]: return t.to(
[rank2]: ^^^^^
[rank2]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 126.00 MiB. GPU 2 has a total capacity of 23.67 GiB of which 17.25 MiB is free. Including non-PyTorch memory, this process has 23.65 GiB memory in use. Of the allocated memory 23.39 GiB is allocated by PyTorch, and 13.26 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
W0120 14:21:48.179000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1575915 closing signal SIGTERM
E0120 14:21:49.497000 1575816 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1575912) of binary: /home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/python
Traceback (most recent call last):
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/bin/accelerate", line 8, in<module>sys.exit(main())
^^^^^^
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
args.func(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 1067, in launch_command
deepspeed_launcher(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/accelerate/commands/launch.py", line 771, in deepspeed_launcher
distrib_run.run(args)
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/run.py", line 910, in run
elastic_launch(
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/blwh-px/SOFTWARE/anaconda3/envs/lbq_comfyui/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_flux_lora_deepspeed.py FAILED
------------------------------------------------------------
Failures:
[1]:
time: 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 1575913)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time: 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 1575914)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time: 2025-01-20_14:21:48
host : blwhpx-ThinkStation-PX
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1575912)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
The text was updated successfully, but these errors were encountered:
Hi, I want to train Lora on 4*A5500(24G) but encounter the OOM error.
Command
accelerate launch --config_file "accelerate_config.yaml" train_flux_lora_deepspeed.py --config "train_configs/test_lora-slot.yaml"
accelerate_config.yaml
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
gradient_accumulation_steps: 2
gradient_clipping: 1.0
offload_optimizer_device: none
offload_param_device: none
zero3_init_flag: false
zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
enable_cpu_affinity: false
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
train_configs/test_lora-slot.yaml
model_name: "flux-dev"
data_config:
train_batch_size: 1
num_workers: 4
img_size: 512
img_dir: images-slot/
random_ratio: true # support multi crop preprocessing
report_to: wandb
train_batch_size: 1
output_dir: lora/
max_train_steps: 100000
learning_rate: 1e-5
lr_scheduler: constant
lr_warmup_steps: 10
adam_beta1: 0.9
adam_beta2: 0.999
adam_weight_decay: 0.01
adam_epsilon: 1e-8
max_grad_norm: 1.0
logging_dir: logs
mixed_precision: "bf16"
checkpointing_steps: 2500
checkpoints_total_limit: 10
tracker_project_name: lora_test
resume_from_checkpoint: latest
gradient_accumulation_steps: 2
rank: 16
single_blocks: "1,2,3,4"
double_blocks: null
#disable_sampling: false
#sample_every: 250 # sample every this many steps
#sample_width: 1024
#sample_height: 1024
#sample_steps: 20
Part of error message
The text was updated successfully, but these errors were encountered: