Full Training from Start：CUDA out of memory. #35

YUANMU227 · 2024-09-19T14:31:59Z

Hello, great work! I am trying to perform Full Training from Start, but I am running out of GPU memory. How much GPU resources are needed for training?

The repository states: At least 4A6000 GPUs or 2A100 GPUs will be enough for the training.

I am training on 2*A100 GPUs, each with 80GB. However, I still encounter out of memory issues:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 3.82 GiB (GPU 1; 79.15 GiB total capacity; 71.88 GiB already allocated; 3.40 GiB free; 74.46 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

YUANMU227 · 2024-09-19T14:33:05Z

I trained based on iqa_iaa.sh

dongdk · 2024-11-04T12:18:18Z

is it possible to train the q-align using one A100-80G GPU?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full Training from Start：CUDA out of memory. #35

Full Training from Start：CUDA out of memory. #35

YUANMU227 commented Sep 19, 2024

YUANMU227 commented Sep 19, 2024

dongdk commented Nov 4, 2024

Full Training from Start：CUDA out of memory. #35

Full Training from Start：CUDA out of memory. #35

Comments

YUANMU227 commented Sep 19, 2024

YUANMU227 commented Sep 19, 2024

dongdk commented Nov 4, 2024