Training threads don't start on Windows #3

donamin · 2017-09-18T05:51:57Z

Hi

I started the learning a few minutes ago and this is what I got in command prompt:

E:\agents>python -m agents.scripts.train --logdir=E:\model --config=pendulum
INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\
20170918T084053-pendulum.
WARNING:tensorflow:Number of agents should divide episodes per update.

It's been like this for about 10 minutes and tensorboard doesn't show anything.
In the log directory, there is only one file called 'config.yaml'.
Is it ok? It would be nice to see if the agent is progressing or it is hung or something.

Thanks
Amin

The text was updated successfully, but these errors were encountered:

donamin · 2017-09-18T07:02:28Z

I changed update_every value from 25 to 30 to resolve this warning:
Number of agents should divide episodes per update.
But still it doesn't seem to be working.

Weird thing is that sometimes when I run the code, I get the following exception:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

donamin · 2017-09-18T07:28:05Z

Update: When I change env_processes to False, it seems to be working! But I guess it disables all the parallelism that this framework is presenting, right?

danijar · 2017-09-22T12:43:13Z

It could be normal that TensorBoard doesn't show anything for a while. The frequency for writing logs is define inside _define_loop() in train.py. This is set to twice per epoch where one training epoch is config.update_every * config.max_length steps and one evaluation epoch is config.eval_episodes * config.max_length steps. It could be that either your environment is very slow or that an epoch consists of a large number of steps for you.

What environment are you using and how long are episodes typically? Can you post your full config?

donamin · 2017-09-22T12:57:55Z

I worked on that and it seems there's some other problem with the code:
Now it's showing this error:

Traceback (most recent call last):
  File "E:/agents/agents/scripts/train.py", line 165, in <module>
    tf.app.run()
  File "C:\Python\Python35\lib\site-packages\tensorflow\python\platform\app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "E:/agents/agents/scripts/train.py", line 147, in main
    for score in train(config, FLAGS.env_processes):
  File "E:/agents/agents/scripts/train.py", line 113, in train
    config.num_agents, env_processes)
  File "E:\agents\agents\scripts\utility.py", line 72, in define_batch_env
    for _ in range(num_agents)]
  File "E:\agents\agents\scripts\utility.py", line 72, in <listcomp>
    for _ in range(num_agents)]
  File "E:\agents\agents\tools\wrappers.py", line 333, in __init__
    self._process.start()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 212, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\context.py", line 313, in _Popen
    return Popen(process_obj)
  File "C:\Python\Python35\lib\multiprocessing\popen_spawn_win32.py", line 66, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Python\Python35\lib\multiprocessing\reduction.py", line 59, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "E:\agents\agents\tools\wrappers.py", line 405, in close
    self._process.join()
  File "C:\Python\Python35\lib\multiprocessing\process.py", line 120, in join
    assert self._popen is not None, 'can only join a started process'
AssertionError: can only join a started process

If I change env_processes to False, it works! Do you know what's the problem?

danijar · 2017-09-22T13:10:18Z

Please wrap code blocks in 3 back ticks. Your configuration must be pickable and it looks like yours is not. Try to define it without using lambdas. As alternatives, define external functions, nested functions, or use functools.partial(). I need to see your configuration to help further.

donamin · 2017-09-22T13:50:53Z

OK I got an update:

In train.py, I changed this line:
batch_env = utility.define_batch_env(lambda: _create_environment(config), config.num_agents, env_processes)
into this:
batch_env = utility.define_batch_env(_create_environment(config), config.num_agents, env_processes)
Not it doesn't give me that previous error, but now it seems to be freezing after showing this log:

INFO:tensorflow:Start a new run and write summaries and checkpoints to E:\model\20170922-165119-pendulum.
[2017-09-22 16:51:19,149] Making new env: Pendulum-v0

The CPU overload for my python is 0% so it doesn't to be doing anything. Any ideas?

This is my configs:

def default():
  """Default configuration for PPO."""
  # General
  algorithm = ppo.PPOAlgorithm
  num_agents = 10
  eval_episodes = 25
  use_gpu = False
  # Network
  network = networks.ForwardGaussianPolicy
  weight_summaries = dict(all=r'.*', policy=r'.*/policy/.*', value=r'.*/value/.*')
  policy_layers = 200, 100
  value_layers = 200, 100
  init_mean_factor = 0.05
  init_logstd = -1
  # Optimization
  update_every = 30
  policy_optimizer = 'AdamOptimizer'
  value_optimizer = 'AdamOptimizer'
  update_epochs_policy = 50
  update_epochs_value = 50
  policy_lr = 1e-4
  value_lr = 3e-4
  # Losses
  discount = 0.985
  kl_target = 1e-2
  kl_cutoff_factor = 2
  kl_cutoff_coef = 1000
  kl_init_penalty = 1
  return locals()

danijar · 2017-09-22T14:21:13Z

Where is the env defined in your config? You should not create the environments in the main process as you did by removing the lambda.

donamin · 2017-09-22T16:32:09Z

I thought that we give env as one of the main arguments in command prompt.
So how should I create the environments? You mean I should change the default code structure so I can make the BatchPPO work?

danijar · 2017-09-23T09:33:46Z

No, I meant you should undo the change you made to the batch env line. You define environments in your config by setting env = ... to either the name of a registered Gym environment or to a function that returns an env object.

donamin · 2017-09-23T16:16:20Z

Oh OK I found out what I did wrong with removing the lambda keyword.
But how can I solve this using external or nested functions? I did a lot of searching but couldn't figure this out since I'm kind of new to Python. Can you help me with this?
How is that it is working on your computer and not on mine? Because not being able to pickle lambda functions seems to be a Python feature, and I already tried Python 3.5 and 3.6.

danijar · 2017-09-24T08:26:29Z

I've seen it working on many people's computers :)

Please check if YAML is installed:

python3 -c "import ruamel.yaml; print('success')"

And check if the Pendulum environment works:

python3 -c "import gym; e=gym.make('Pendulum-v0'); e.reset(); e.render(); input('success')"

If both works please start from a fresh clone of this repository and report your error message again.

donamin · 2017-09-24T08:37:52Z

Thanks for your reply.

I tried both tests with success.

I cloned the repository again and the code doesn't work. It's not showing me that lambda error but it stays still when it reaches this line of code in wrappers.py:
self._process.start()

When I use debugging, stepping into start function eventually takes guides me to this line in context.py (The code hangs when it reaches this line):
from .popen_spawn_win32 import Popen

BTW, I'm using Windows 10. Maybe it has something to do with OS?

danijar · 2017-09-24T10:31:31Z

Yea, that might be the problem. Processing is quite different between Windows and Linux/Mac and we mainly tested on the latter. I'm afraid I can't be of much help since I don't use Windows. Do you have an idea how to debug this? I'd be happy to test and merge a fix if you come up with one.

donamin · 2017-09-24T10:40:10Z

OK thanks for your reply. I have no idea right now. But I will work on it because it's kind of important for me to make it work on Windows. I'll let you if it's solved.
Thanks :)

danijar · 2017-11-09T00:20:09Z

@donamin Where you able to narrow down this issue?

donamin · 2017-11-09T05:39:46Z

@danijar No I couldn't solve it so I had to switch to linux. Sorry.

danijar · 2017-11-09T12:57:50Z

Thanks for getting back. I'll keep this issue open for now. We might support Windows in the future since as far as I can see the threading is the only platform-specific bit. But unfortunately, there are no concrete plans for this at the moment.

erwincoumans · 2018-11-24T06:51:12Z

It seems you cannot use the _worker class method for multiprocessing.Process on Windows.
If you use a global def globalworker( constructor, conn):
it will not hang. But then it cannot use getattr.
Is there a way to rewrite _worker to be a globalworker?

   self._process = multiprocessing.Process(
        target=globalworker, args=(constructor, conn))

danijar · 2018-12-18T17:27:25Z

@erwincoumans Yes, this seems trivial since self._worker() does not access any object state. You'd just have to replace the occurrences of self with ExternalProcess. I'd be happy to accept a patch if this indeed fixes the behavior on Windows. I don't have a way to test on Windows myself.

danijar changed the title ~~How to check the agent learning progress?~~ Training threads don't start on Windows Oct 6, 2017

danijar added the help wanted label Jan 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training threads don't start on Windows #3

Training threads don't start on Windows #3

donamin commented Sep 18, 2017 •

edited by danijar

Loading

donamin commented Sep 18, 2017 •

edited by danijar

Loading

donamin commented Sep 18, 2017

danijar commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 22, 2017 •

edited by danijar

Loading

danijar commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 22, 2017 •

edited

Loading

danijar commented Sep 22, 2017

donamin commented Sep 22, 2017 •

edited

Loading

danijar commented Sep 23, 2017

donamin commented Sep 23, 2017 •

edited

Loading

danijar commented Sep 24, 2017

donamin commented Sep 24, 2017 •

edited

Loading

danijar commented Sep 24, 2017

donamin commented Sep 24, 2017

danijar commented Nov 9, 2017

donamin commented Nov 9, 2017

danijar commented Nov 9, 2017

erwincoumans commented Nov 24, 2018

danijar commented Dec 18, 2018

Training threads don't start on Windows #3

Training threads don't start on Windows #3

Comments

donamin commented Sep 18, 2017 • edited by danijar Loading

donamin commented Sep 18, 2017 • edited by danijar Loading

donamin commented Sep 18, 2017

danijar commented Sep 22, 2017 • edited Loading

donamin commented Sep 22, 2017 • edited by danijar Loading

danijar commented Sep 22, 2017 • edited Loading

donamin commented Sep 22, 2017 • edited Loading

danijar commented Sep 22, 2017

donamin commented Sep 22, 2017 • edited Loading

danijar commented Sep 23, 2017

donamin commented Sep 23, 2017 • edited Loading

danijar commented Sep 24, 2017

donamin commented Sep 24, 2017 • edited Loading

danijar commented Sep 24, 2017

donamin commented Sep 24, 2017

danijar commented Nov 9, 2017

donamin commented Nov 9, 2017

danijar commented Nov 9, 2017

erwincoumans commented Nov 24, 2018

danijar commented Dec 18, 2018

donamin commented Sep 18, 2017 •

edited by danijar

Loading

donamin commented Sep 18, 2017 •

edited by danijar

Loading

danijar commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 22, 2017 •

edited by danijar

Loading

danijar commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 22, 2017 •

edited

Loading

donamin commented Sep 23, 2017 •

edited

Loading

donamin commented Sep 24, 2017 •

edited

Loading