aligner train torch.cuda.OutOfMemoryError #120058
Replies: 1 comment
-
Hi @angelOnly, thanks for participating! We take our Code of Conduct very seriously and want to help ensure that everyone has a good experience free of antagonism and harassment. Unfortunately, we don’t currently have moderators for languages other than English. Until that changes, we need to ask that everyone use English here in the GitHub Community when posting. We’ll be locking any posts in languages other than English for now, including this one. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Select Topic Area
Question
Body
CUDA out of memory. Tried to allocate 13.98 GiB. GPU 0 has a total capacity of 23.65 GiB of which 9.09 GiB is free. Process 783524 has 14.55 GiB memory in use. Of the allocated memory 13.98 GiB is allocated by PyTorch, and 3.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
看了训练流程图,我理解这个对齐器是不是在全参微调,我跑百川7b的模型,4090 24G的显卡,跑不起来,显存满了,只能换更大的显存吗?多大的显卡合适?
训练参数
venv/bin/deepspeed "${DEEPSPEED_ARGS[@]}"
--module safe_rlhf.finetune
--train_datasets correction-json::${DATASET}
--model_name_or_path "${MODEL_NAME_OR_PATH}"
--max_length 512
--trust_remote_code True
--epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--gradient_checkpointing
--learning_rate 2e-5
--lr_scheduler_type cosine
--lr_warmup_ratio 0.03
--weight_decay 0.0
--seed 42
--output_dir "${OUTPUT_DIR}"
--log_type wandb
--log_project Aligner-SFT
--zero_stage "${ZERO_STAGE}"
--offload "${OFFLOAD}"
--bf16 True
--tf32 True
--save_16bit
Beta Was this translation helpful? Give feedback.
All reactions