Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error: out of memory #279

Open
yux326 opened this issue Apr 23, 2022 · 3 comments
Open

CUDA error: out of memory #279

yux326 opened this issue Apr 23, 2022 · 3 comments

Comments

@yux326
Copy link

yux326 commented Apr 23, 2022

我正在使用两块2080ti训练hrnet,出现了 Out of memory报错
每块显卡有11GB可用显存
我的配置文件是 " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
输出报错信息为:
I‘m running the train.py with two 2080ti but got a "Out of memery" error.
Each of the two gpus has 11GB memory available.
My config file is " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
The output infomation is:

=> creating output/coco/pose_hrnet/w32_256x192_adam_lr1e-3
=> creating log/coco/pose_hrnet/w32_256x192_adam_lr1e-3_2022-04-23-08-07
Namespace(cfg='experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml', dataDir='', logDir='', modelDir='', opts=[], prevModelDir='')
AUTO_RESUME: True
CUDNN:
BENCHMARK: True
DETERMINISTIC: False
ENABLED: True
DATASET:
COLOR_RGB: True
DATASET: coco
DATA_FORMAT: jpg
FLIP: True
HYBRID_JOINTS_TYPE:
NUM_JOINTS_HALF_BODY: 8
PROB_HALF_BODY: 0.3
ROOT: data/coco/
ROT_FACTOR: 45
SCALE_FACTOR: 0.35
SELECT_DATA: False
TEST_SET: val2017
TRAIN_SET: train2017
DATA_DIR:
DEBUG:
DEBUG: True
SAVE_BATCH_IMAGES_GT: True
SAVE_BATCH_IMAGES_PRED: True
SAVE_HEATMAPS_GT: True
SAVE_HEATMAPS_PRED: True
GPUS: (2, 5)
LOG_DIR: log
LOSS:
TOPK: 8
USE_DIFFERENT_JOINTS_WEIGHT: False
USE_OHKM: False
USE_TARGET_WEIGHT: True
MODEL:
EXTRA:
FINAL_CONV_KERNEL: 1
PRETRAINED_LAYERS: ['conv1', 'bn1', 'conv2', 'bn2', 'layer1', 'transition1', 'stage2', 'transition2', 'stage3', 'transition3', 'stage4']
STAGE2:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4]
NUM_BRANCHES: 2
NUM_CHANNELS: [32, 64]
NUM_MODULES: 1
STAGE3:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4]
NUM_BRANCHES: 3
NUM_CHANNELS: [32, 64, 128]
NUM_MODULES: 4
STAGE4:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4, 4]
NUM_BRANCHES: 4
NUM_CHANNELS: [32, 64, 128, 256]
NUM_MODULES: 3
HEATMAP_SIZE: [48, 64]
IMAGE_SIZE: [192, 256]
INIT_WEIGHTS: True
NAME: pose_hrnet
NUM_JOINTS: 17
PRETRAINED: models/pytorch/imagenet/hrnet_w32-36af842e.pth
SIGMA: 2
TAG_PER_JOINT: True
TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
BATCH_SIZE_PER_GPU: 32
BBOX_THRE: 1.0
COCO_BBOX_FILE: data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json
FLIP_TEST: True
IMAGE_THRE: 0.0
IN_VIS_THRE: 0.2
MODEL_FILE:
NMS_THRE: 1.0
OKS_THRE: 0.9
POST_PROCESS: True
SHIFT_HEATMAP: True
SOFT_NMS: False
USE_GT_BBOX: True
TRAIN:
BATCH_SIZE_PER_GPU: 2
BEGIN_EPOCH: 0
CHECKPOINT:
END_EPOCH: 210
GAMMA1: 0.99
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [170, 200]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
Traceback (most recent call last):
File "tools/train.py", line 223, in
main()
File "tools/train.py", line 92, in main
cfg, is_train=True
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 499, in get_pose_net
model.init_weights(cfg['MODEL']['PRETRAINED'])
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 481, in init_weights
pretrained_state_dict = torch.load(pretrained)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
return storage_type(obj.size())
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/cuda/init.py", line 606, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@Masum06
Copy link

Masum06 commented Jul 19, 2022

@yanyixuan2000 did you find a solution?
I have the same error with CUDA version 11.7 and pytoch cuda 11.6.

@anas-zafar
Copy link

Trying using batch size BATCH_SIZE_PER_GPU: 16 or 8

@yanbiaogit
Copy link

batch_size调低一点,在yaml文件里面

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants