Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gpu 训练失败 #271

Open
HaviZou opened this issue Aug 2, 2023 · 1 comment
Open

gpu 训练失败 #271

HaviZou opened this issue Aug 2, 2023 · 1 comment

Comments

@HaviZou
Copy link

HaviZou commented Aug 2, 2023

cpu 可以正常训练
已安装cuda 2.0.0+cu118
gpu显示已经启用
[INFO 2023-08-02 14:31:54,016 _log_device_info:1798] GPU available: True, used: True

PS D:\CnOCR> cnocr train -m densenet_lite_136-fc --index-dir data/images/labels_connect10 --train-config-fp docs/examples/train_config_gpu_written_num.json
[WARNING 2023-08-02 14:31:53,623 _showwarnmsg:109] D:\CnOCR\env\lib\site-packages\pytorch_lightning\core\datamodule.py:95: LightningDeprecationWarning: DataModule property train_transforms was deprecated in v1.5 and will be removed in v1.7.
rank_zero_deprecation(

[WARNING 2023-08-02 14:31:53,624 _showwarnmsg:109] D:\CnOCR\env\lib\site-packages\pytorch_lightning\core\datamodule.py:114: LightningDeprecationWarning: DataModule property val_transforms was deprecated in v1.5 and will be removed in v1.7.
rank_zero_deprecation(

[INFO 2023-08-02 14:31:54,016 _check_and_init_precision:696] Using 16bit native Automatic Mixed Precision (AMP)
[INFO 2023-08-02 14:31:54,016 _log_device_info:1798] GPU available: True, used: True
[INFO 2023-08-02 14:31:54,028 _log_device_info:1803] TPU available: False, using: 0 TPU cores
[INFO 2023-08-02 14:31:54,028 _log_device_info:1806] IPU available: False, using: 0 IPUs
[INFO 2023-08-02 14:31:54,029 _log_device_info:1809] HPU available: False, using: 0 HPUs
[INFO 2023-08-02 14:31:54,029 _determine_batch_limits:2858] Trainer(limit_train_batches=1.0) was configured so 100% of the batches per epoch will be used..
[INFO 2023-08-02 14:31:54,029 _determine_batch_limits:2858] Trainer(limit_val_batches=1.0) was configured so 100% of the batches will be used..
[INFO 2023-08-02 14:31:54,047 train:160] OcrModel(
(encoder): DenseNetLite(
(features): Sequential(
(conv0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
(norm0): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu0): ReLU(inplace=True)
(pool0): AvgPool2d(kernel_size=2, stride=2, padding=0)
(denseblock1): _DenseBlock(
(denselayer1): _DenseLayer(
(norm1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
(transition1): _Transition(
(norm): BatchNorm2d(96, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv): Conv2d(96, 48, kernel_size=(1, 1), stride=(1, 1), bias=False)
(pool): AvgPool2d(kernel_size=2, stride=2, padding=0)
)
(denseblock2): _DenseBlock(
(denselayer1): _DenseLayer(
(norm1): BatchNorm2d(48, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(48, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer2): _DenseLayer(
(norm1): BatchNorm2d(80, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(80, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer3): _DenseLayer(
(norm1): BatchNorm2d(112, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(112, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
)
(transition2): _Transition(
(norm): BatchNorm2d(144, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu): ReLU(inplace=True)
(conv): Conv2d(144, 72, kernel_size=(1, 1), stride=(1, 1), bias=False)
(pool): AvgPool2d(kernel_size=2, stride=2, padding=0)
)
(denseblock3): _DenseBlock(
(denselayer1): _DenseLayer(
(norm1): BatchNorm2d(72, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(72, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer2): _DenseLayer(
(norm1): BatchNorm2d(104, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(104, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer3): _DenseLayer(
(norm1): BatchNorm2d(136, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(136, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer4): _DenseLayer(
(norm1): BatchNorm2d(168, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(168, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer5): _DenseLayer(
(norm1): BatchNorm2d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(200, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
)
(denselayer6): _DenseLayer(
(norm1): BatchNorm2d(232, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu1): ReLU(inplace=True)
(conv1): Conv2d(232, 128, kernel_size=(1, 1), stride=(1, 1), bias=False)
(norm2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu2): ReLU(inplace=True)
(conv2): Conv2d(128, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2), bias=False)
)
)
(norm5): BatchNorm2d(264, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(pool5): AvgPool2d(kernel_size=(2, 1), stride=(2, 1), padding=0)
)
)
(decoder): Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=528, out_features=128, bias=True)
(2): Dropout(p=0.1, inplace=False)
(3): Tanh()
)
(linear): Linear(in_features=128, out_features=11, bias=True)
)
[INFO 2023-08-02 14:31:54,149 set_nvidia_flags:57] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
[INFO 2023-08-02 14:31:54,152 summarize:73]
| Name | Type | Params

0 | model | OcrModel | 680 K

680 K Trainable params
0 Non-trainable params
680 K Total params
1.361 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Traceback (most recent call last):
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\runpy.py", line 87, in run_code
exec(code, run_globals)
File "D:\CnOCR\env\Scripts\cnocr.exe_main
.py", line 7, in
File "D:\CnOCR\env\lib\site-packages\click\core.py", line 1130, in call
return self.main(*args, **kwargs)
File "D:\CnOCR\env\lib\site-packages\click\core.py", line 1055, in main
rv = self.invoke(ctx)
File "D:\CnOCR\env\lib\site-packages\click\core.py", line 1657, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "D:\CnOCR\env\lib\site-packages\click\core.py", line 1404, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "D:\CnOCR\env\lib\site-packages\click\core.py", line 760, in invoke
return __callback(*args, **kwargs)
File "D:\CnOCR\env\lib\site-packages\cnocr\cli.py", line 165, in train
trainer.fit(
File "D:\CnOCR\env\lib\site-packages\cnocr\trainer.py", line 324, in fit
self.pl_trainer.fit(pl_module, train_dataloader, val_dataloaders, datamodule)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 768, in fit
self._call_and_handle_interrupt(
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 721, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 809, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1234, in _run
results = self._run_stage()
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1321, in _run_stage
return self._run_train()
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1343, in _run_train
self._run_sanity_check()
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1411, in _run_sanity_check
val_loop.run()
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\loops\base.py", line 204, in run
self.advance(*args, **kwargs)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\loops\dataloader\evaluation_loop.py", line 154, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\loops\base.py", line 199, in run
self.on_run_start(*args, **kwargs)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\loops\epoch\evaluation_epoch_loop.py", line 87, in on_run_start
self._data_fetcher = iter(data_fetcher)
File "D:\CnOCR\env\lib\site-packages\pytorch_lightning\utilities\fetching.py", line 178, in iter
self.dataloader_iter = iter(self.dataloader)
File "D:\CnOCR\env\lib\site-packages\torch\utils\data\dataloader.py", line 442, in iter
return self._get_iterator()
File "D:\CnOCR\env\lib\site-packages\torch\utils\data\dataloader.py", line 388, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "D:\CnOCR\env\lib\site-packages\torch\utils\data\dataloader.py", line 1043, in init
w.start()
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\context.py", line 327, in _Popen
return Popen(process_obj)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\popen_spawn_win32.py", line 93, in init
reduction.dump(process_obj, to_child)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'OcrDataModule.val_dataloader..'
PS D:\CnOCR> Traceback (most recent call last):
File "", line 1, in
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python39_64\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

@IASNDHABHINADAD
Copy link

请问解决了吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants