How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

yxchng · 2017-02-26T07:45:13Z

What I mean by monitor here is to use gym.wrappers.

dementrock · 2017-02-26T22:07:24Z

It should be relatively easy if you follow gym's documentation. Also you can refer to gym_env.py which is a wrapper around gym environments.

yxchng · 2017-02-27T02:58:58Z

observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)

actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)

What should I change to above part when dealing with the following environment?

env = normalize(GymEnv("CartPole-v1"))

Currently I am getting the following error.

TypeError: ('Bad input argument to theano function with name "vpg_1.py:81" at index 1 (0-based)', 'Wrong number of dimensions: expected 2, got 1 with shape (1633,).')

I am using CategoricalMLPPolicy.

yxchng · 2017-02-27T03:00:54Z

The documentation is very incomplete and the examples do not tell me anything related to what I want to do.

Don't quite understand why it works for GaussianMLP and the original env but when those are changed, it doesnt work anymore.

dementrock · 2017-02-27T07:39:21Z

It is because CartPole uses discrete actions. Use a CategoricalMLPPolicy instead.

yxchng · 2017-02-27T11:48:10Z

I mentioned above I used CategoricalMLPPolicy but I still face the same problem

dementrock · 2017-02-27T16:31:47Z

Can you show your entire snippet of code?

leduckhc · 2017-02-27T17:21:10Z

Hi @yxchng, I could not find gym environment CartPole-v1 in https://github.com/openai/gym/wiki/Table-of-environments. If such environment exists, the gym wiki should be definitely updated. For CartPole-v0 the obs_space is Box(4,) and action_space is Discrete(2), i.e. move_left, move_right. The normalize environment works only for Continuous actions. It does not make sense to normalize discrete actions.

To simply solve your problem:

Remove normalize environment
Use Categorical policy, such as CategoricalMLPPolicy
If that does not solve your problem, post your full code snippet so we can investigate the error.

yxchng · 2017-02-28T05:35:11Z

from rllab.envs.box2d.mountain_car_env import MountainCarEnv

from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.envs.normalized_env import normalize
from rllab.envs.gym_env import GymEnv
import rllab.misc.logger as logger


import numpy as np
import theano
import theano.tensor as TT
from lasagne.updates import adam
from gym import wrappers
import gym
import logging

# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
# env = normalize(CartpoleEnv())

# env = normalize(GymEnv("CartPole-v1", record_video=False, log_dir="exps/CartPole-v1/"))
env = GymEnv("CartPole-v0", record_video=False, log_dir="exps/CartPole/")

# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = GaussianMLPPolicy(env.spec, hidden_sizes=(4,))

# We will collect 100 trajectories per iteration
N = 10
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

# Construct the computation graph

# Create a Theano variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)
actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)
returns_var = TT.vector('returns')

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and the logarithm of the standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var)

# policy.distribution returns a distribution object under rllab.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# rllab.distributions.DiagonalGaussian
dist = policy.distribution

# Note that we negate the objective, since most optimizers assume a minimization problem
surr = - TT.mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)

# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = theano.grad(surr, params)

f_train = theano.function(
    inputs=[observations_var, actions_var, returns_var],
    outputs=None,
    updates=adam(grads, params, learning_rate=learning_rate),
    allow_input_downcast=True
)

for _ in range(n_itr):

    paths = []

    for _ in range(N):
        observations = []
        actions = []
        rewards = []

        observation = env.reset()

        for _ in range(T):
            # policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
            # sufficient statistics for the action distribution. It should at least contain entries that would be
            # returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
            # Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
            # not needed.
            action, _ = policy.get_action(observation)
            # Recall that the last entry of the tuple stores diagnostic information about the environment. In our
            # case it is not needed.
            next_observation, reward, terminal, _ = env.step(action)
            observations.append(observation)
            actions.append(action)
            rewards.append(reward)
            observation = next_observation
            if terminal:
                # Finish rollout if terminal state reached
                break

        # We need to compute the empirical return for each time step along the
        # trajectory
        returns = []
        return_so_far = 0
        for t in range(len(rewards) - 1, -1, -1):
            return_so_far = rewards[t] + discount * return_so_far
            returns.append(return_so_far)
        # The returns are stored backwards in time, so we need to revert it
        returns = returns[::-1]

        paths.append(dict(
            observations=np.array(observations),
            actions=np.array(actions),
            rewards=np.array(rewards),
            returns=np.array(returns)
        ))

    observations = np.concatenate([p["observations"] for p in paths])
    actions = np.concatenate([p["actions"] for p in paths])
    returns = np.concatenate([p["returns"] for p in paths])

    f_train(observations, actions, returns)
    print('Average Return:', np.mean([sum(p["rewards"]) for p in paths]))

env.close()
gym.upload('/exps/CartPole/', api_key='sk_6jB59OwBQICeXwMlUsqUBw')

This is my full code. The error still persists.

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 13:32:59,722] Making new env: CartPole-v0
[2017-02-28 13:32:59,729] Clearing 2 monitor files from previous run (because force=True was provided)
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 26, in <module>
    policy = GaussianMLPPolicy(env.spec, hidden_sizes=(4,))
  File "/Users/yxchng/Desktop/rllab-master/rllab/policies/gaussian_mlp_policy.py", line 55, in __init__
    assert isinstance(env_spec.action_space, Box)
AssertionError

dementrock · 2017-02-28T05:37:51Z

Hi, it seems like you are using the GaussianMLPPolicy, but you should be using CategoricalMLPPolicy.

Another thing with discrete actions: methods like dist.log_likelihood_sym expects the received actions to be using one-hot representation. When you write actions.append(action), replace it by actions.append(env.action_space.flatten(action)), and it should work.

yxchng · 2017-02-28T09:48:52Z

I edited the code according to your advice.

from rllab.policies.categorical_mlp_policy import CategoricalMLPPolicy
from rllab.envs.normalized_env import normalize
from rllab.envs.gym_env import GymEnv
import rllab.misc.logger as logger


import numpy as np
import theano
import theano.tensor as TT
from lasagne.updates import adam
from gym import wrappers
import gym
import logging

# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
# env = normalize(CartpoleEnv())

env = GymEnv("CartPole-v0", record_video=False, log_dir="exps/cartpole/")

# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = CategoricalMLPPolicy(env.spec, hidden_sizes=(4,))

# We will collect 100 trajectories per iteration
N = 10
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

# Construct the computation graph

# Create a Theano variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)
actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)
returns_var = TT.vector('returns')

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and the logarithm of the standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var)

# policy.distribution returns a distribution object under rllab.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# rllab.distributions.DiagonalGaussian
dist = policy.distribution

# Note that we negate the objective, since most optimizers assume a minimization problem
surr = - TT.mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)

# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = theano.grad(surr, params)

f_train = theano.function(
    inputs=[observations_var, actions_var, returns_var],
    outputs=None,
    updates=adam(grads, params, learning_rate=learning_rate),
    allow_input_downcast=True
)

for _ in range(n_itr):

    paths = []

    for _ in range(N):
        observations = []
        actions = []
        rewards = []

        observation = env.reset()

        for _ in range(T):
            # policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
            # sufficient statistics for the action distribution. It should at least contain entries that would be
            # returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
            # Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
            # not needed.
            action, _ = policy.get_action(observation)
            # Recall that the last entry of the tuple stores diagnostic information about the environment. In our
            # case it is not needed.
            next_observation, reward, terminal, _ = env.step(action)
            observations.append(observation)
            actions.append(env.action_space.flatten(action))
            rewards.append(reward)
            observation = next_observation
            if terminal:
                # Finish rollout if terminal state reached
                break

        # We need to compute the empirical return for each time step along the
        # trajectory
        returns = []
        return_so_far = 0
        for t in range(len(rewards) - 1, -1, -1):
            return_so_far = rewards[t] + discount * return_so_far
            returns.append(return_so_far)
        # The returns are stored backwards in time, so we need to revert it
        returns = returns[::-1]

        paths.append(dict(
            observations=np.array(observations),
            actions=np.array(actions),
            rewards=np.array(rewards),
            returns=np.array(returns)
        ))

    observations = np.concatenate([p["observations"] for p in paths])
    actions = np.concatenate([p["actions"] for p in paths])
    returns = np.concatenate([p["returns"] for p in paths])

    f_train(observations, actions, returns)
    print('Average Return:', np.mean([sum(p["rewards"]) for p in paths]))

env.close()
gym.upload('/exps/cartpole/', api_key='sk_6jB59OwBQICeXwMlUsqUBw')

I follow your advice but now it gives me

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 17:46:55,728] Making new env: CartPole-v0
[2017-02-28 17:46:55,735] Creating monitor directory exps/cartpole/
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 87, in <module>
    observation = env.reset()
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 105, in reset
    recorder = self.env._monitor.stats_recorder
AttributeError: '_Monitor' object has no attribute '_monitor'
[2017-02-28 17:46:56,563] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/yxchng/Desktop/rllab-master/examples/exps/cartpole')
(rllab3)Chngs-MacBook-Pro-3:examples yxchng$ python vpg_1_acrobot.py 
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 17:47:38,920] Making new env: CartPole-v0
[2017-02-28 17:47:38,927] Clearing 2 monitor files from previous run (because force=True was provided)
Average Return: 34.7
Average Return: 35.4
Average Return: 36.6
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 87, in <module>
    observation = env.reset()
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 108, in reset
    return self.env.reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/core.py", line 123, in reset
    observation = self._reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 39, in _reset
    self._before_reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 190, in _before_reset
    self.stats_recorder.before_reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/monitoring/stats_recorder.py", line 68, in before_reset
    raise error.Error("Tried to reset environment which is not done. While the monitor is active for {}, you cannot call reset() unless the episode is over.".format(self.env_id))
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

yxchng · 2017-02-28T09:51:22Z

I tried using force_reset=True but it doesnt work.

dementrock · 2017-02-28T23:12:59Z

Hi,

Make sure you are using this commit of gym exactly: 93d554bdbb4b2d29ff1a685158dbde93b36e3801

Refer to https://github.com/openai/rllab/blob/master/environment.yml. Make sure you are using the latest rllab code.

gy2256 · 2017-04-01T20:23:59Z

I also have the same error

raise error.Error("Tried to reset environment which is not done. While the monitor is active for {}, you cannot call reset() unless the episode is over.".format(self.env_id))
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

dementrock · 2017-04-01T23:00:25Z

@gyang1011 set force_reset to True.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

yxchng commented Feb 26, 2017

dementrock commented Feb 26, 2017

yxchng commented Feb 27, 2017 •

edited

Loading

yxchng commented Feb 27, 2017 •

edited

Loading

dementrock commented Feb 27, 2017

yxchng commented Feb 27, 2017

dementrock commented Feb 27, 2017

leduckhc commented Feb 27, 2017

yxchng commented Feb 28, 2017

dementrock commented Feb 28, 2017

yxchng commented Feb 28, 2017 •

edited

Loading

yxchng commented Feb 28, 2017

dementrock commented Feb 28, 2017

gy2256 commented Apr 1, 2017

dementrock commented Apr 1, 2017

How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

Comments

yxchng commented Feb 26, 2017

dementrock commented Feb 26, 2017

yxchng commented Feb 27, 2017 • edited Loading

yxchng commented Feb 27, 2017 • edited Loading

dementrock commented Feb 27, 2017

yxchng commented Feb 27, 2017

dementrock commented Feb 27, 2017

leduckhc commented Feb 27, 2017

yxchng commented Feb 28, 2017

dementrock commented Feb 28, 2017

yxchng commented Feb 28, 2017 • edited Loading

yxchng commented Feb 28, 2017

dementrock commented Feb 28, 2017

gy2256 commented Apr 1, 2017

dementrock commented Apr 1, 2017

yxchng commented Feb 27, 2017 •

edited

Loading

yxchng commented Feb 27, 2017 •

edited

Loading

yxchng commented Feb 28, 2017 •

edited

Loading