-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA error: out of memory #279
Comments
@yanyixuan2000 did you find a solution? |
Trying using batch size BATCH_SIZE_PER_GPU: 16 or 8 |
batch_size调低一点,在yaml文件里面 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
我正在使用两块2080ti训练hrnet,出现了 Out of memory报错
每块显卡有11GB可用显存
我的配置文件是 " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
输出报错信息为:
I‘m running the train.py with two 2080ti but got a "Out of memery" error.
Each of the two gpus has 11GB memory available.
My config file is " experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml"
The output infomation is:
=> creating output/coco/pose_hrnet/w32_256x192_adam_lr1e-3
=> creating log/coco/pose_hrnet/w32_256x192_adam_lr1e-3_2022-04-23-08-07
Namespace(cfg='experiments/coco/hrnet/w32_256x192_adam_lr1e-3.yaml', dataDir='', logDir='', modelDir='', opts=[], prevModelDir='')
AUTO_RESUME: True
CUDNN:
BENCHMARK: True
DETERMINISTIC: False
ENABLED: True
DATASET:
COLOR_RGB: True
DATASET: coco
DATA_FORMAT: jpg
FLIP: True
HYBRID_JOINTS_TYPE:
NUM_JOINTS_HALF_BODY: 8
PROB_HALF_BODY: 0.3
ROOT: data/coco/
ROT_FACTOR: 45
SCALE_FACTOR: 0.35
SELECT_DATA: False
TEST_SET: val2017
TRAIN_SET: train2017
DATA_DIR:
DEBUG:
DEBUG: True
SAVE_BATCH_IMAGES_GT: True
SAVE_BATCH_IMAGES_PRED: True
SAVE_HEATMAPS_GT: True
SAVE_HEATMAPS_PRED: True
GPUS: (2, 5)
LOG_DIR: log
LOSS:
TOPK: 8
USE_DIFFERENT_JOINTS_WEIGHT: False
USE_OHKM: False
USE_TARGET_WEIGHT: True
MODEL:
EXTRA:
FINAL_CONV_KERNEL: 1
PRETRAINED_LAYERS: ['conv1', 'bn1', 'conv2', 'bn2', 'layer1', 'transition1', 'stage2', 'transition2', 'stage3', 'transition3', 'stage4']
STAGE2:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4]
NUM_BRANCHES: 2
NUM_CHANNELS: [32, 64]
NUM_MODULES: 1
STAGE3:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4]
NUM_BRANCHES: 3
NUM_CHANNELS: [32, 64, 128]
NUM_MODULES: 4
STAGE4:
BLOCK: BASIC
FUSE_METHOD: SUM
NUM_BLOCKS: [4, 4, 4, 4]
NUM_BRANCHES: 4
NUM_CHANNELS: [32, 64, 128, 256]
NUM_MODULES: 3
HEATMAP_SIZE: [48, 64]
IMAGE_SIZE: [192, 256]
INIT_WEIGHTS: True
NAME: pose_hrnet
NUM_JOINTS: 17
PRETRAINED: models/pytorch/imagenet/hrnet_w32-36af842e.pth
SIGMA: 2
TAG_PER_JOINT: True
TARGET_TYPE: gaussian
OUTPUT_DIR: output
PIN_MEMORY: True
PRINT_FREQ: 100
RANK: 0
TEST:
BATCH_SIZE_PER_GPU: 32
BBOX_THRE: 1.0
COCO_BBOX_FILE: data/coco/person_detection_results/COCO_val2017_detections_AP_H_56_person.json
FLIP_TEST: True
IMAGE_THRE: 0.0
IN_VIS_THRE: 0.2
MODEL_FILE:
NMS_THRE: 1.0
OKS_THRE: 0.9
POST_PROCESS: True
SHIFT_HEATMAP: True
SOFT_NMS: False
USE_GT_BBOX: True
TRAIN:
BATCH_SIZE_PER_GPU: 2
BEGIN_EPOCH: 0
CHECKPOINT:
END_EPOCH: 210
GAMMA1: 0.99
GAMMA2: 0.0
LR: 0.001
LR_FACTOR: 0.1
LR_STEP: [170, 200]
MOMENTUM: 0.9
NESTEROV: False
OPTIMIZER: adam
RESUME: False
SHUFFLE: True
WD: 0.0001
WORKERS: 24
=> init weights from normal distribution
Traceback (most recent call last):
File "tools/train.py", line 223, in
main()
File "tools/train.py", line 92, in main
cfg, is_train=True
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 499, in get_pose_net
model.init_weights(cfg['MODEL']['PRETRAINED'])
File "/home/deep-high-resolution-net.pytorch/tools/../lib/models/pose_hrnet.py", line 481, in init_weights
pretrained_state_dict = torch.load(pretrained)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 608, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 787, in _legacy_load
result = unpickler.load()
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 743, in persistent_load
deserialized_objects[root_key] = restore_location(obj, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 175, in default_restore_location
result = fn(storage, location)
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/serialization.py", line 155, in _cuda_deserialize
return storage_type(obj.size())
File "/opt/conda/envs/py36torch/lib/python3.6/site-packages/torch/cuda/init.py", line 606, in _lazy_new
return super(_CudaBase, cls).new(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
The text was updated successfully, but these errors were encountered: