Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning #56

Open
jtamir opened this issue Oct 4, 2019 · 8 comments

Comments

@jtamir
Copy link
Contributor

jtamir commented Oct 4, 2019

I am following the guide to optimize hyperparameters over multiple GPUs: https://towardsdatascience.com/trivial-multi-node-training-with-pytorch-lightning-ff75dfb809bd

However, when I run the hyperparam opt, I get the following error:

RuntimeError: cuda runtime error (3) : initialization error at /pytorch/aten/src/THC/THCGeneral.cpp:54

Based on some reading, it seems to be an issue with initializing CUDA and multiprocessing, with the suggested change of adding multiprocessing.set_start_method('spawn', force=True).

Looking at argparse_hopt.py, I see that that specific line is commented out. When I uncomment it, I get through that error but hit a pickle error:

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

Looking for suggestions on what to try, thanks!

@jtamir jtamir changed the title Multiprocessing running optimize_parallel_gpu with pytorch + pytorch-lightning Multiprocessing error running optimize_parallel_gpu with pytorch + pytorch-lightning Oct 4, 2019
@antvconst
Copy link

Also having this one

@williamFalcon
Copy link
Owner

I might need to update that post, but run these demos instead:

https://github.com/williamFalcon/pytorch-lightning/tree/master/examples/multi_node_examples

@williamFalcon
Copy link
Owner

could you post your code and error?

@antvconst
Copy link

antvconst commented Oct 11, 2019

I'm kind of looking for a way to evaluate each set of hyperparams on a separate GPU in parallel, not train a single model on multiple GPUs. I've tried this:

def train_one(hparam, gpu_id_set):
    # load data, create model, create logger and checkpoint callback
    trainer = Trainer(logger=tt_logger, checkpoint_callback=checkpoint_callback,
                      gpus=[int(gpu_id_set)], max_nb_epochs=hparam.epochs, weights_summary=None)
    trainer.fit(model)
    trainer.test(model)

hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)

Here gpu_ids is ['0', '1'].

And here's the output:

gpu available: True, used: True
VISIBLE GPUS: 0
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp line=54 error=3 : initialization error
Caught exception in worker thread cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54
Traceback (most recent call last):
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 37, in optimize_parallel_gpu_private
    results = train_function(trial_params, gpu_id_set)
  File "main.py", line 40, in train_one
    trainer.fit(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 754, in fit
    self.__single_gpu_train(model)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/pytorch_lightning-0.5.1.3-py3.7.egg/pytorch_lightning/trainer/trainer.py", line 793, in __singl
e_gpu_train
    model.cuda(self.root_gpu)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in cuda
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 208, in _apply
    module._apply(fn)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 230, in _apply
    param_applied = fn(param)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/nn/modules/module.py", line 311, in <lambda>
    return self._apply(lambda t: t.cuda(device))
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/torch/cuda/__init__.py", line 179, in _lazy_init
    torch._C._cuda_init()
RuntimeError: cuda runtime error (3) : initialization error at /opt/conda/conda-bld/pytorch_1565272271120/work/aten/src/THC/THCGeneral.cpp:54

After this one there's exact same message, but with GPU 1. After this process seems to hang. Being killed by Ctrl-C it also outputs

Traceback (most recent call last):
  File "main.py", line 65, in <module>
    hparams.optimize_parallel_gpu(train_one, max_nb_trials=400, gpu_ids=gpu_ids)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/site-packages/test_tube/argparse_hopt.py", line 323, in optimize_parallel_gpu
    results = self.pool.map(optimize_parallel_gpu_private, self.trials)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 268, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 651, in get
    self.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/multiprocessing/pool.py", line 648, in wait
    self._event.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 552, in wait
    signaled = self._cond.wait(timeout)
  File "/home/akonstantinov/.miniconda3/envs/cpn/lib/python3.7/threading.py", line 296, in wait
    waiter.acquire()
KeyboardInterrupt```

@BraveDistribution
Copy link

BraveDistribution commented Feb 24, 2020

Any help? Having similar issue but without multi-node (I have two GPUs in my PC).

Traceback (most recent call last):
  File "D:/Users//MFT/MFT/simulation_runner.py", line 40, in <module>
    hparams.optimize_parallel_gpu('test', gpu_ids=['0'], max_nb_trials=1)

  File "D:\Users\\anaconda3\envs\pytorch\lib\site-packages\test_tube\argparse_hopt.py", line 322, in optimize_parallel_gpu
    self.pool = Pool(processes=nb_workers, initializer=init, initargs=(gpu_q,))
  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 174, in __init__
    self._repopulate_pool()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\pool.py", line 239, in _repopulate_pool
    w.start()

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)

  File "D:\Users\\anaconda3\envs\pytorch\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu.<locals>.init'

and the entry:

def main(hparams):
    early_stopping = EarlyStopping('val_acc', patience=20)

    trainer = Trainer(
                         max_nb_epochs=50,
                         gpus=[0],
                         early_stop_callback=early_stopping,
                         train_percent_check=1,
                         check_val_every_n_epoch=1,
                         val_percent_check=1
                         )
    system = ParkinsonDecisionSystem(hparams)
    if hparams.evaluate:
        trainer.run_evaluation()
    else:
        trainer.fit(system)

if __name__ == '__main__':
    parent_parser = HyperOptArgumentParser(strategy='grid_search')
    parent_parser.opt_list('--augmentation', default="None", type=str, tunable=True, options=["Erosion",
                                                                                         "Gaussian",
                                                                                         "None",
                                                                                         "Median"])
    parent_parser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',
                               help='evaluate model on validation set')

    parent_parser.add_argument("--model_name", metavar="model_name", type=str, default=None,
                        help="Name od model from model_enum")

    hparams = parent_parser.parse_args()
    hparams.optimize_parallel_gpu(main, gpu_ids=['0'], max_nb_trials=1)

@jtamir
Copy link
Contributor Author

jtamir commented Feb 24, 2020

I was able to get this working, though I can't remember exactly all the steps. My code is available here: https://github.com/jtamir/deepinpy/blob/master/main.py#L113

Things I remember being important:

  • Set num_workers=0 in your DataLoader or Python will try to spawn multiple Multiprocessing pools
  • Always pass GPU ID 0 to Pytorch Lightning's trainer, because TestTube already handles the GPU IDs: https://github.com/jtamir/deepinpy/blob/master/main.py#L46
  • Set distributed_backend=None in the trainer for similar reasons

@BraveDistribution
Copy link

@jtamir

Thank you for your quick response. I was lucky and I solved this myself even sooner.

You did mention everything except one thing. In testtube I had to remove the nested functions in (when pool is creating new processes) because pickle can't handle that. This removed the error: AttributeError: Can't pickle local object 'HyperOptArgumentParser.optimize_parallel_gpu..init'

after that I did all your steps and it works!

Maybe I will write a blog post about that or something... Also we should make a pull request for testtube (I will look into it...).

@yang-xidian
Copy link

use torch=1.13.0 pytorch-lighting=1.0.8 output:
File "/home/xiaoyang/python/envs/taming/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1265, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'LightningDistributedDataParallel' object has no attribute '_sync_params'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants