Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot re-initialize CUDA in forked subprocess #4

Open
moose-in-australia opened this issue Aug 31, 2019 · 1 comment
Open

Cannot re-initialize CUDA in forked subprocess #4

moose-in-australia opened this issue Aug 31, 2019 · 1 comment

Comments

@moose-in-australia
Copy link

I am trying to run training for the end-to-end masked transformer using the ActivityNet data set. Currently I am running this on an AWS EC2 instance of type p2.xlarge, which has one GPU. I call the training script as follows:

CUDA_VISIBLE_DEVICES=0 python scripts/train.py --dist_url ./ss_model --cfgs_file cfgs/anet.yml --checkpoint_path ./checkpoint/ss_model --batch_size 14 --world_size 1 --cuda --sent_weight 0.25 --mask_weight 1.0 --gated_mask | tee log/ss_model-0

Unfortunately I run into the error below with regards to multiprocessing. So far I have been unable to debug it successfully. When adding the spawn method as indicated by the error messages, further errors occur. I would appreciate any help in figuring out what I'm doing wrong.

train.py:122: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  options_yaml = yaml.load(handle)
Namespace(alpha=0.95, attn_dropout=0.2, batch_size=14, beta=0.999, cap_dropout=0.2, cfgs_file='cfgs/anet.yml', checkpoint_path='./checkpoint/weird', cls_weight=1.0, cuda=True, d_hidden=2048, d_model=1024, dataset='anet', dataset_file='./data/anet/anet_annotations_trainval.json', densecap_references=['./data/anet/val_1.json', './data/anet/val_2.json'], dist_backend='gloo', dist_url='./weird', dur_file='./data/anet/anet_duration_frame.csv', enable_visdom=False, epsilon=1e-08, feature_root='./dataset', gated_mask=True, grad_norm=1, image_feat_size=3072, in_emb_dropout=0.1, kernel_list=[1, 2, 3, 4, 5, 7, 9, 11, 15, 21, 29, 41, 57, 71, 111, 161, 211, 251], learning_rate=0.1, load_train_samplelist=False, load_valid_samplelist=False, loss_alpha_r=2, losses_log_every=1, mask_weight=1.0, max_epochs=20, max_sentence_len=20, n_heads=8, n_layers=2, neg_thresh=0.3, num_workers=1, optim='sgd', patience_epoch=1, pos_thresh=0.7, reduce_factor=0.5, reg_weight=10, sample_prob=0, sampling_sec=0.5, save_checkpoint_every=1, save_train_samplelist=False, save_valid_samplelist=False, scst_weight=0.0, seed=213, sent_weight=0.25, slide_window_size=480, slide_window_stride=20, start_from='', stride_factor=50, train_data_folder=['training'], train_sample=20, train_samplelist_path='/z/home/luozhou/subsystem/densecap_vid/train_samplelist.pkl', val_data_folder=['validation'], valid_batch_size=64, valid_samplelist_path='/z/home/luozhou/subsystem/densecap_vid/valid_samplelist.pkl', vis_emb_dropout=0.1, world_size=1)
loading dataset
# of words in the vocab: 4563
# of sentences in training: 37421, # of sentences in validation: 17505
# of training videos: 10009
size of the sentence block variable (['training']): torch.Size([37415, 20])
Process ForkPoolWorker-1:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
    torch.cuda._lazy_init()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ForkPoolWorker-2:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
    torch.cuda._lazy_init()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Process ForkPoolWorker-3:
Process ForkPoolWorker-4:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
    torch.cuda._lazy_init()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 95, in rebuild_storage_cuda
    torch.cuda._lazy_init()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 159, in _lazy_init
    "Cannot re-initialize CUDA in forked subprocess. " + msg)
RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c line=150 error=3 : initialization error
Process ForkPoolWorker-5:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/multiprocessing/queues.py", line 337, in get
    return _ForkingPickler.loads(res)
  File "/home/ubuntu/miniconda3/envs/demo_ss2/lib/python3.6/site-packages/torch/multiprocessing/reductions.py", line 94, in rebuild_storage_cuda
    return storage._new_view(offset, view_size)
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c:150
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1525909934016/work/aten/src/THC/generic/THCStorage.c line=150 error=3 : initialization error
@LuoweiZhou
Copy link
Owner

@moose-in-australia you may want to refer to this issue: salesforce#11

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants