-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56
Comments
Also having this one |
I might need to update that post, but run these demos instead: https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/multi_node_examples |
could you post your code and error? |
I'm kind of looking for a way to evaluate each set of hyperparams on a separate GPU in parallel, not train a single model on multiple GPUs. I've tried this: def train_one(hparam, gpu_id_set):
# load data, create model, create logger and checkpoint callback
trainer = Trainer(logger=tt_logger, checkpoint_callback=checkpoint_callback,
gpus=[int(gpu_id_set)], max_nb_epochs=hparam.epochs, weights_summary=None)
trainer.fit(model)
trainer.test(model)
hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids) Here And here's the output: gpu available: True, used: True
VISIBLE GPUS: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Caught exception in worker thread cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
Traceback (most recent call last):
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
results = train_function(trial_params, gpu_id_set)
File "main.py", line 40, in train_one
trainer.fit(model)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 754, in fit
self.__single_gpu_train(model)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 793, in __singl
e_gpu_train
model.cuda(self.root_gpu)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in cuda
return self._apply(lambda t: t.cuda(device))
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
module._apply(fn)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 230, in _apply
param_applied = fn(param)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in <lambda>
return self._apply(lambda t: t.cuda(device))
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54 After this one there's exact same message, but with GPU 1. After this process seems to hang. Being killed by Ctrl-C it also outputs Traceback (most recent call last):
File "main.py", line 65, in <module>
hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 323, in optimize_parallel_gpu
results = self.pool.map(optimize_parallel_gpu_private, self.trials)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 268, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 651, in get
self.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 648, in wait
self._event.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 296, in wait
waiter.acquire()
KeyboardInterrupt``` |
Any help? Having similar issue but without multi-node (I have two GPUs in my PC).
and the entry:
|
I was able to get this working, though I can't remember exactly all the steps. My code is available here: https://github.com/jtamir/deepinpy/blob/master/main.py#L113 Things I remember being important:
|
Thank you for your quick response. I was lucky and I solved this myself even sooner. You did mention everything except one thing. In testtube I had to remove the nested functions in (when pool is creating new processes) because pickle can't handle that. This removed the error: AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init' after that I did all your steps and it works! Maybe I will write a blog post about that or something... Also we should make a pull request for testtube (I will look into it...). |
use torch=1.13.0 pytorch-lighting=1.0.8 output: |
I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd
However, when I run the hyperparam opt, I get the following error:
Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding
multiprocessing.set_start_method('spawn', force=True)
.Looking at
argparse_hopt.py
, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:Looking for suggestions on what to try, thanks!
The text was updated successfully, but these errors were encountered: