CUDA error: out of memory #279

yux326 · 2022-04-23T08:47:58Z

我正在使用两块2080ti训练hrnet，出现了 Out of memory报错
每块显卡有11GB可用显存
我的配置文件是 " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
输出报错信息为：
I‘m running the train.py with two 2080ti but got a "Out of memery" error.
Each of the two gpus has 11GB memory available.
My config file is " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
The output infomation is:

=> creating output/coco/pose_hrnet/w32_256x192_adam_lr1e-3
=> creating log/coco/pose_hrnet/w32_256x192_adam_lr1e-3_2022-04-23-08-07
Namespace(cfg='experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml', dataDir='', logDir='', modelDir='', opts=[], prevModelDir='')
AUTO_RESUME: True
CUDNN:
BENCHMARK: True
DETERMINISTIC: False
ENABLED: True
DATASET:
COLOR_RGB: True
DATASET: coco
DATA_FORMAT: jpg
FLIP: True
HYBRID_JOINTS_TYPE:
NUM_JOINTS_HALF_BODY: 8
PROB_HALF_BODY: 0.3
ROOT: data/coco/
ROT_FACTOR: 45
SCALE_FACTOR: 0.35
SELECT_DATA: False
TEST_SET: val2017
TRAIN_SET: train2017
DATA_DIR:
DEBUG:
DEBUG: True
SAVE_BATCH_IMAGES_GT: True
SAVE_BATCH_IMAGES_PRED: True
SAVE_HEATMAPS_GT: True
SAVE_HEATMAPS_PRED: True
GPUS: (2, 5)
LOG_DIR: log
LOSS:
TOPK: 8
USE_DIFFERENT_JOINTS_WEIGHT: False
USE_OHKM: False
USE_TARGET_WEIGHT: True
MODEL:
EXTRA:
FINAL_CONV_KERNEL: 1
PRETRAINED_LAYERS: ['conv1', 'bn1', 'conv2', 'bn2', 'layer1', 'transition1', 'stage2', 'transition2', 'stage3', 'transition3', 'stage4']
STAGE2:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4]
NUM_BRANCHES: 2
NUM_CHANNELS: [32, 64]
NUM_MODULES: 1
STAGE3:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4]
NUM_BRANCHES: 3
NUM_CHANNELS: [32, 64, 128]
NUM_MODULES: 4
STAGE4:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4, 4]
NUM_BRANCHES: 4
NUM_CHANNELS: [32, 64, 128, 256]
NUM_MODULES: 3
HEATMAP_SIZE: [48, 64]
IMAGE_SIZE: [192, 256]
INIT_WEIGHTS: True
NAME: pose_hrnet
NUM_JOINTS: 17
PRETRAINED: models/pytorch/imagenet/hrnet_w32-36af842e.pth
SIGMA: 2
TAG_PER_JOINT: True
TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
BATCH_SIZE_PER_GPU: 32
BBOX_THRE: 1.0
COCO_BBOX_FILE: data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json
FLIP_TEST: True
IMAGE_THRE: 0.0
IN_VIS_THRE: 0.2
MODEL_FILE:
NMS_THRE: 1.0
OKS_THRE: 0.9
POST_PROCESS: True
SHIFT_HEATMAP: True
SOFT_NMS: False
USE_GT_BBOX: True
TRAIN:
BATCH_SIZE_PER_GPU: 2
BEGIN_EPOCH: 0
CHECKPOINT:
END_EPOCH: 210
GAMMA1: 0.99
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [170, 200]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
Traceback (most recent call last):
File "tools/train.py", line 223, in
main()
File "tools/train.py", line 92, in main
cfg, is_train=True
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 499, in get_pose_net
model.init_weights(cfg['MODEL']['PRETRAINED'])
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 481, in init_weights
pretrained_state_dict = torch.load(pretrained)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
return storage_type(obj.size())
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/cuda/init.py", line 606, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Masum06 · 2022-07-19T02:27:31Z

@yanyixuan2000 did you find a solution?
I have the same error with CUDA version 11.7 and pytoch cuda 11.6.

anas-zafar · 2022-12-27T19:47:14Z

Trying using batch size BATCH_SIZE_PER_GPU: 16 or 8

yanbiaogit · 2023-09-15T09:58:05Z

batch_size调低一点，在yaml文件里面

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error: out of memory #279

CUDA error: out of memory #279

yux326 commented Apr 23, 2022

Masum06 commented Jul 19, 2022

anas-zafar commented Dec 27, 2022

yanbiaogit commented Sep 15, 2023

CUDA error: out of memory #279

CUDA error: out of memory #279

Comments

yux326 commented Apr 23, 2022

Masum06 commented Jul 19, 2022

anas-zafar commented Dec 27, 2022

yanbiaogit commented Sep 15, 2023