Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training threads don't start on Windows #3

Open
donamin opened this issue Sep 18, 2017 · 19 comments
Open

Training threads don't start on Windows #3

donamin opened this issue Sep 18, 2017 · 19 comments

Comments

@donamin
Copy link

donamin commented Sep 18, 2017

Hi

I started the learning a few minutes ago and this is what I got in command prompt:

E:\agents>python -m agents.scripts.train --logdir=E:\model --config=pendulum
INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\
20170918T084053-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.

It's been like this for about 10 minutes and tensorboard doesn't show anything.
In the log directory, there is only one file called 'config.yaml'.
Is it ok? It would be nice to see if the agent is progressing or it is hung or something.

Thanks
Amin

@donamin
Copy link
Author

donamin commented Sep 18, 2017

I changed update_every value from 25 to 30 to resolve this warning:
Number of agents should divide episodes per update.
But still it doesn't seem to be working.

Weird thing is that sometimes when I run the code, I get the following exception:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

@donamin
Copy link
Author

donamin commented Sep 18, 2017

Update: When I change env_processes to False, it seems to be working! But I guess it disables all the parallelism that this framework is presenting, right?

@danijar
Copy link
Contributor

danijar commented Sep 22, 2017

It could be normal that TensorBoard doesn't show anything for a while. The frequency for writing logs is define inside _define_loop() in train.py. This is set to twice per epoch where one training epoch is config.update_every * config.max_length steps and one evaluation epoch is config.eval_episodes * config.max_length steps. It could be that either your environment is very slow or that an epoch consists of a large number of steps for you.

What environment are you using and how long are episodes typically? Can you post your full config?

@donamin
Copy link
Author

donamin commented Sep 22, 2017

I worked on that and it seems there's some other problem with the code:
Now it's showing this error:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

If I change env_processes to False, it works! Do you know what's the problem?

@danijar
Copy link
Contributor

danijar commented Sep 22, 2017

Please wrap code blocks in 3 back ticks. Your configuration must be pickable and it looks like yours is not. Try to define it without using lambdas. As alternatives, define external functions, nested functions, or use functools.partial(). I need to see your configuration to help further.

@donamin
Copy link
Author

donamin commented Sep 22, 2017

OK I got an update:

In train.py, I changed this line:
batch_env = utility.define_batch_env(lambda: _create_environment(config), config.num_agents, env_processes)
into this:
batch_env = utility.define_batch_env(_create_environment(config), config.num_agents, env_processes)
Not it doesn't give me that previous error, but now it seems to be freezing after showing this log:

INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\20170922-165119-pendulum.
[2017-09-22 16:51:19,149] Making new env: Pendulum-v0

The CPU overload for my python is 0% so it doesn't to be doing anything. Any ideas?

This is my configs:

def default():
  """Default configuration for PPO."""
  # General
  algorithm = ppo.PPOAlgorithm
  num_agents = 10
  eval_episodes = 25
  use_gpu = False
  # Network
  network = networks.ForwardGaussianPolicy
  weight_summaries = dict(all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*')
  policy_layers = 200, 100
  value_layers = 200, 100
  init_mean_factor = 0.05
  init_logstd = -1
  # Optimization
  update_every = 30
  policy_optimizer = 'AdamOptimizer'
  value_optimizer = 'AdamOptimizer'
  update_epochs_policy = 50
  update_epochs_value = 50
  policy_lr = 1e-4
  value_lr = 3e-4
  # Losses
  discount = 0.985
  kl_target = 1e-2
  kl_cutoff_factor = 2
  kl_cutoff_coef = 1000
  kl_init_penalty = 1
  return locals()

@danijar
Copy link
Contributor

danijar commented Sep 22, 2017

Where is the env defined in your config? You should not create the environments in the main process as you did by removing the lambda.

@donamin
Copy link
Author

donamin commented Sep 22, 2017

I thought that we give env as one of the main arguments in command prompt.
So how should I create the environments? You mean I should change the default code structure so I can make the BatchPPO work?

@danijar
Copy link
Contributor

danijar commented Sep 23, 2017

No, I meant you should undo the change you made to the batch env line. You define environments in your config by setting env = ... to either the name of a registered Gym environment or to a function that returns an env object.

@donamin
Copy link
Author

donamin commented Sep 23, 2017

Oh OK I found out what I did wrong with removing the lambda keyword.
But how can I solve this using external or nested functions? I did a lot of searching but couldn't figure this out since I'm kind of new to Python. Can you help me with this?
How is that it is working on your computer and not on mine? Because not being able to pickle lambda functions seems to be a Python feature, and I already tried Python 3.5 and 3.6.

@danijar
Copy link
Contributor

danijar commented Sep 24, 2017

I've seen it working on many people's computers :)

Please check if YAML is installed:

python3 -c "import ruamel.yaml; print('success')"

And check if the Pendulum environment works:

python3 -c "import gym; e=gym.make('Pendulum-v0'); e.reset(); e.render(); input('success')"

If both works please start from a fresh clone of this repository and report your error message again.

@donamin
Copy link
Author

donamin commented Sep 24, 2017

Thanks for your reply.

I tried both tests with success.

I cloned the repository again and the code doesn't work. It's not showing me that lambda error but it stays still when it reaches this line of code in wrappers.py:
self._process.start()

When I use debugging, stepping into start function eventually takes guides me to this line in context.py (The code hangs when it reaches this line):
from .popen_spawn_win32 import Popen

BTW, I'm using Windows 10. Maybe it has something to do with OS?

@danijar
Copy link
Contributor

danijar commented Sep 24, 2017

Yea, that might be the problem. Processing is quite different between Windows and Linux/Mac and we mainly tested on the latter. I'm afraid I can't be of much help since I don't use Windows. Do you have an idea how to debug this? I'd be happy to test and merge a fix if you come up with one.

@donamin
Copy link
Author

donamin commented Sep 24, 2017

OK thanks for your reply. I have no idea right now. But I will work on it because it's kind of important for me to make it work on Windows. I'll let you if it's solved.
Thanks :)

@danijar danijar changed the title How to check the agent learning progress? Training threads don't start on Windows Oct 6, 2017
@danijar
Copy link
Contributor

danijar commented Nov 9, 2017

@donamin Where you able to narrow down this issue?

@donamin
Copy link
Author

donamin commented Nov 9, 2017

@danijar No I couldn't solve it so I had to switch to linux. Sorry.

@danijar
Copy link
Contributor

danijar commented Nov 9, 2017

Thanks for getting back. I'll keep this issue open for now. We might support Windows in the future since as far as I can see the threading is the only platform-specific bit. But unfortunately, there are no concrete plans for this at the moment.

@erwincoumans
Copy link

It seems you cannot use the _worker class method for multiprocessing.Process on Windows.
If you use a global def globalworker( constructor, conn):
it will not hang. But then it cannot use getattr.
Is there a way to rewrite _worker to be a globalworker?

   self._process = multiprocessing.Process(
        target=globalworker, args=(constructor, conn))

@danijar
Copy link
Contributor

danijar commented Dec 18, 2018

@erwincoumans Yes, this seems trivial since self._worker() does not access any object state. You'd just have to replace the occurrences of self with ExternalProcess. I'd be happy to accept a patch if this indeed fixes the behavior on Windows. I don't have a way to test on Windows myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants