-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
单卡训练时出现loss是nan,请问是什么原因 #5
Comments
晒一下训练log |
2022-12-15 15:19:25,166 - mmrotate - INFO - Environment info:sys.platform: linux
TorchVision: 0.13.1
|
{"env_info": "sys.platform: linux\nPython: 3.8.13 | packaged by conda-forge | (default, Mar 25 2022, 06:04:18) [GCC 10.3.0]\nCUDA available: True\nGPU 0: NVIDIA GeForce RTX 3090\nCUDA_HOME: /usr/local/cuda-11.6\nNVCC: Cuda compilation tools, release 11.6, V11.6.55\nGCC: gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0\nPyTorch: 1.12.1\nPyTorch compiling details: PyTorch built with:\n - GCC 9.3\n - C++ Version: 201402\n - Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v2.6.0 (Git Hash 52b5f107dd9cf10910aaa19cb47f3abf9b349815)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 11.6\n - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37\n - CuDNN 8.3.2 (built against CUDA 11.5)\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.6, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.12.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n\nTorchVision: 0.13.1\nOpenCV: 4.6.0\nMMCV: 1.6.0\nMMCV Compiler: GCC 9.3\nMMCV CUDA Compiler: 11.6\nMMRotate: 0.1.0+5fe611f", "config": "dataset_type = 'DOTADataset'\ndata_root = '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/'\nimg_norm_cfg = dict(\n mean=[123.675, 116.28, 103.53], std=[58.395, 57.12, 57.375], to_rgb=True)\ntrain_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n]\ntest_pipeline = [\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n]\ndata = dict(\n samples_per_gpu=1,\n workers_per_gpu=2,\n train=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(type='LoadAnnotations', with_bbox=True),\n dict(type='RResize', img_scale=(1024, 1024)),\n dict(\n type='RRandomFlip',\n flip_ratio=[0.25, 0.25, 0.25],\n direction=['horizontal', 'vertical', 'diagonal']),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img', 'gt_bboxes', 'gt_labels'])\n ],\n version='oc'),\n val=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/annfiles/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/trainval/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'),\n test=dict(\n type='DOTADataset',\n ann_file=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n img_prefix=\n '/media/kemove/B83CD2EA3CD2A324/CODE/mmrotate/data/DOTA1.0/test/test_split_1024_200/images/',\n pipeline=[\n dict(type='LoadImageFromFile'),\n dict(\n type='MultiScaleFlipAug',\n img_scale=(1024, 1024),\n flip=False,\n transforms=[\n dict(type='RResize'),\n dict(\n type='Normalize',\n mean=[123.675, 116.28, 103.53],\n std=[58.395, 57.12, 57.375],\n to_rgb=True),\n dict(type='Pad', size_divisor=32),\n dict(type='DefaultFormatBundle'),\n dict(type='Collect', keys=['img'])\n ])\n ],\n version='oc'))\nevaluation = dict(interval=12, metric='mAP')\noptimizer = dict(type='SGD', lr=0.0025, momentum=0.9, weight_decay=0.0001)\noptimizer_config = dict(grad_clip=dict(max_norm=35, norm_type=2))\nlr_config = dict(\n policy='step',\n warmup='linear',\n warmup_iters=500,\n warmup_ratio=0.3333333333333333,\n step=[8, 11])\nrunner = dict(type='EpochBasedRunner', max_epochs=12)\ncheckpoint_config = dict(interval=12)\nlog_config = dict(\n interval=50,\n hooks=[dict(type='TextLoggerHook'),\n dict(type='TensorboardLoggerHook')])\ndist_params = dict(backend='nccl')\nlog_level = 'INFO'\nload_from = None\nresume_from = None\nworkflow = [('train', 1)]\nopencv_num_threads = 0\nmp_start_method = 'fork'\nangle_version = 'oc'\nmodel = dict(\n type='KnowledgeDistillationRotatedSingleStageDetector',\n backbone=dict(\n type='ResNet',\n depth=18,\n num_stages=4,\n out_indices=(0, 1, 2, 3),\n frozen_stages=1,\n zero_init_residual=False,\n norm_cfg=dict(type='BN', requires_grad=True),\n norm_eval=True,\n style='pytorch',\n init_cfg=dict(type='Pretrained', checkpoint='torchvision://resnet18')),\n neck=dict(\n type='FPN',\n in_channels=[64, 128, 256, 512],\n out_channels=256,\n start_level=1,\n add_extra_convs='on_input',\n num_outs=5),\n bbox_head=dict(\n type='LDRotatedRetinaHead',\n num_classes=15,\n in_channels=256,\n stacked_convs=4,\n feat_channels=256,\n assign_by_circumhbbox='oc',\n anchor_generator=dict(\n type='RotatedAnchorGenerator',\n octave_base_scale=4,\n scales_per_octave=3,\n ratios=[1.0, 0.5, 2.0],\n strides=[8, 16, 32, 64, 128]),\n bbox_coder=dict(\n type='DeltaXYWHAOBBoxCoder',\n angle_range='oc',\n norm_factor=None,\n edge_swap=False,\n proj_xy=False,\n target_means=(0.0, 0.0, 0.0, 0.0, 0.0),\n target_stds=(1.0, 1.0, 1.0, 1.0, 1.0)),\n loss_cls=dict(\n type='FocalLoss',\n use_sigmoid=True,\n gamma=2.0,\n alpha=0.25,\n loss_weight=1.0),\n loss_bbox=dict(type='GDLoss', loss_weight=5.0, loss_type='gwd'),\n reg_max=8,\n reg_decoded_bbox=True,\n loss_ld=dict(type='GDLoss', loss_type='gwd', loss_weight=5.0),\n loss_kd=dict(\n type='KnowledgeDistillationKLDivLoss', loss_weight=30, T=5),\n loss_im=dict(type='IMLoss', loss_weight=2.0),\n imitation_method='finegrained'),\n train_cfg=dict(\n assigner=dict(\n type='MaxIoUAssigner',\n pos_iou_thr=0.5,\n neg_iou_thr=0.4,\n min_pos_iou=0,\n ignore_iof_thr=-1,\n iou_calculator=dict(type='RBboxOverlaps2D')),\n allowed_border=-1,\n pos_weight=-1,\n debug=False),\n test_cfg=dict(\n nms_pre=2000,\n min_bbox_size=0,\n score_thr=0.05,\n nms=dict(iou_thr=0.1),\n max_per_img=2000),\n teacher_config=\n './configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc.py',\n teacher_ckpt=\n '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth',\n output_feature=True)\nteacher_ckpt = '/media/kemove/B83CD2EA3CD2A324/CODE/Rotated-LD/configs/gwd/rotated_retinanet_hbb_gwd_r50_fpn_1x_dota_oc-41fd7805.pth'\nwork_dir = './work_dirs/rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc'\nauto_resume = False\ngpu_ids = range(0, 1)\n", "seed": 884529645, "exp_name": "rotated_retinanet_distribution_hbb_gwd_r18_r50_fpn_1x_dota_oc.py"} |
|
这种情况多半是数据集的问题,预处理步骤或者标签问题。请遵循mmrotate的数据集下载与预处理方式 可以先关闭蒸馏损失,在config文件中将KD与LD,Feature imitation的损失设为0,观察是否还会nan |
谢谢!数据集不会有问题,用了很久的,在mmrotate里训练不存在问题。我试一下关闭特征蒸馏损失看一下效果。我使用自己写的Feature imitation 方法也存在损失NAN的问题。 |
关掉蒸馏损失之后,训练不会在出现NAN的情况,是为什么 |
你的教师模型用的哪个,config文件是哪个 |
|
肯定不能用mmrotate官方的权重作为教师啊,它的box表示是4个数,而不是4n个数的概率分布 |
No description provided.
The text was updated successfully, but these errors were encountered: