aligner train torch.cuda.OutOfMemoryError #120058

angelOnly · 2024-04-18T08:46:02Z

angelOnly
Apr 18, 2024

Select Topic Area

Question

Body

CUDA out of memory. Tried to allocate 13.98 GiB. GPU 0 has a total capacity of 23.65 GiB of which 9.09 GiB is free. Process 783524 has 14.55 GiB memory in use. Of the allocated memory 13.98 GiB is allocated by PyTorch, and 3.49 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

看了训练流程图，我理解这个对齐器是不是在全参微调，我跑百川7b的模型，4090 24G的显卡，跑不起来，显存满了，只能换更大的显存吗？多大的显卡合适？

训练参数
venv/bin/deepspeed "${DEEPSPEED_ARGS[@]}"
--module safe_rlhf.finetune
--train_datasets correction-json::${DATASET}
--model_name_or_path "${MODEL_NAME_OR_PATH}"
--max_length 512
--trust_remote_code True
--epochs 3
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 4
--gradient_checkpointing
--learning_rate 2e-5
--lr_scheduler_type cosine
--lr_warmup_ratio 0.03
--weight_decay 0.0
--seed 42
--output_dir "${OUTPUT_DIR}"
--log_type wandb
--log_project Aligner-SFT
--zero_stage "${ZERO_STAGE}"
--offload "${OFFLOAD}"
--bf16 True
--tf32 True
--save_16bit

ebndev · 2024-04-25T18:40:32Z

ebndev
Apr 25, 2024
Maintainer

Hi @angelOnly, thanks for participating!

We take our Code of Conduct very seriously and want to help ensure that everyone has a good experience free of antagonism and harassment. Unfortunately, we don’t currently have moderators for languages other than English. Until that changes, we need to ask that everyone use English here in the GitHub Community when posting. We’ll be locking any posts in languages other than English for now, including this one.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

aligner train torch.cuda.OutOfMemoryError #120058

{{title}}

Replies: 1 comment

{{title}}

Select a reply

GitHub Community

aligner train torch.cuda.OutOfMemoryError #120058

angelOnly Apr 18, 2024

Select Topic Area

Body

Replies: 1 comment

ebndev Apr 25, 2024 Maintainer

angelOnly
Apr 18, 2024

ebndev
Apr 25, 2024
Maintainer