Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get the vpg_1.py in examples to work with other environments such as Acrobot and monitor their performance? #89

Open
yxchng opened this issue Feb 26, 2017 · 14 comments

Comments

@yxchng
Copy link

yxchng commented Feb 26, 2017

What I mean by monitor here is to use gym.wrappers.

@dementrock
Copy link
Member

It should be relatively easy if you follow gym's documentation. Also you can refer to gym_env.py which is a wrapper around gym environments.

@yxchng
Copy link
Author

yxchng commented Feb 27, 2017

observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)

actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)

What should I change to above part when dealing with the following environment?

env = normalize(GymEnv("CartPole-v1"))

Currently I am getting the following error.

TypeError: ('Bad input argument to theano function with name "vpg_1.py:81" at index 1 (0-based)', 'Wrong number of dimensions: expected 2, got 1 with shape (1633,).')

I am using CategoricalMLPPolicy.

@yxchng
Copy link
Author

yxchng commented Feb 27, 2017

The documentation is very incomplete and the examples do not tell me anything related to what I want to do.

Don't quite understand why it works for GaussianMLP and the original env but when those are changed, it doesnt work anymore.

@dementrock
Copy link
Member

It is because CartPole uses discrete actions. Use a CategoricalMLPPolicy instead.

@yxchng
Copy link
Author

yxchng commented Feb 27, 2017

I mentioned above I used CategoricalMLPPolicy but I still face the same problem

@dementrock
Copy link
Member

Can you show your entire snippet of code?

@leduckhc
Copy link

Hi @yxchng, I could not find gym environment CartPole-v1 in https://github.com/openai/gym/wiki/Table-of-environments. If such environment exists, the gym wiki should be definitely updated. For CartPole-v0 the obs_space is Box(4,) and action_space is Discrete(2), i.e. move_left, move_right. The normalize environment works only for Continuous actions. It does not make sense to normalize discrete actions.

To simply solve your problem:

  1. Remove normalize environment
  2. Use Categorical policy, such as CategoricalMLPPolicy
  3. If that does not solve your problem, post your full code snippet so we can investigate the error.

@yxchng
Copy link
Author

yxchng commented Feb 28, 2017

from rllab.envs.box2d.mountain_car_env import MountainCarEnv

from rllab.envs.box2d.cartpole_env import CartpoleEnv
from rllab.policies.gaussian_mlp_policy import GaussianMLPPolicy
from rllab.envs.normalized_env import normalize
from rllab.envs.gym_env import GymEnv
import rllab.misc.logger as logger


import numpy as np
import theano
import theano.tensor as TT
from lasagne.updates import adam
from gym import wrappers
import gym
import logging

# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
# env = normalize(CartpoleEnv())

# env = normalize(GymEnv("CartPole-v1", record_video=False, log_dir="exps/CartPole-v1/"))
env = GymEnv("CartPole-v0", record_video=False, log_dir="exps/CartPole/")

# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = GaussianMLPPolicy(env.spec, hidden_sizes=(4,))

# We will collect 100 trajectories per iteration
N = 10
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

# Construct the computation graph

# Create a Theano variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)
actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)
returns_var = TT.vector('returns')

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and the logarithm of the standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var)

# policy.distribution returns a distribution object under rllab.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# rllab.distributions.DiagonalGaussian
dist = policy.distribution

# Note that we negate the objective, since most optimizers assume a minimization problem
surr = - TT.mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)

# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = theano.grad(surr, params)

f_train = theano.function(
    inputs=[observations_var, actions_var, returns_var],
    outputs=None,
    updates=adam(grads, params, learning_rate=learning_rate),
    allow_input_downcast=True
)

for _ in range(n_itr):

    paths = []

    for _ in range(N):
        observations = []
        actions = []
        rewards = []

        observation = env.reset()

        for _ in range(T):
            # policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
            # sufficient statistics for the action distribution. It should at least contain entries that would be
            # returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
            # Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
            # not needed.
            action, _ = policy.get_action(observation)
            # Recall that the last entry of the tuple stores diagnostic information about the environment. In our
            # case it is not needed.
            next_observation, reward, terminal, _ = env.step(action)
            observations.append(observation)
            actions.append(action)
            rewards.append(reward)
            observation = next_observation
            if terminal:
                # Finish rollout if terminal state reached
                break

        # We need to compute the empirical return for each time step along the
        # trajectory
        returns = []
        return_so_far = 0
        for t in range(len(rewards) - 1, -1, -1):
            return_so_far = rewards[t] + discount * return_so_far
            returns.append(return_so_far)
        # The returns are stored backwards in time, so we need to revert it
        returns = returns[::-1]

        paths.append(dict(
            observations=np.array(observations),
            actions=np.array(actions),
            rewards=np.array(rewards),
            returns=np.array(returns)
        ))

    observations = np.concatenate([p["observations"] for p in paths])
    actions = np.concatenate([p["actions"] for p in paths])
    returns = np.concatenate([p["returns"] for p in paths])

    f_train(observations, actions, returns)
    print('Average Return:', np.mean([sum(p["rewards"]) for p in paths]))

env.close()
gym.upload('/exps/CartPole/', api_key='sk_6jB59OwBQICeXwMlUsqUBw')

This is my full code. The error still persists.

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 13:32:59,722] Making new env: CartPole-v0
[2017-02-28 13:32:59,729] Clearing 2 monitor files from previous run (because force=True was provided)
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 26, in <module>
    policy = GaussianMLPPolicy(env.spec, hidden_sizes=(4,))
  File "/Users/yxchng/Desktop/rllab-master/rllab/policies/gaussian_mlp_policy.py", line 55, in __init__
    assert isinstance(env_spec.action_space, Box)
AssertionError

@dementrock
Copy link
Member

Hi, it seems like you are using the GaussianMLPPolicy, but you should be using CategoricalMLPPolicy.

Another thing with discrete actions: methods like dist.log_likelihood_sym expects the received actions to be using one-hot representation. When you write actions.append(action), replace it by actions.append(env.action_space.flatten(action)), and it should work.

@yxchng
Copy link
Author

yxchng commented Feb 28, 2017

I edited the code according to your advice.

from rllab.policies.categorical_mlp_policy import CategoricalMLPPolicy
from rllab.envs.normalized_env import normalize
from rllab.envs.gym_env import GymEnv
import rllab.misc.logger as logger


import numpy as np
import theano
import theano.tensor as TT
from lasagne.updates import adam
from gym import wrappers
import gym
import logging

# normalize() makes sure that the actions for the environment lies
# within the range [-1, 1] (only works for environments with continuous actions)
# env = normalize(CartpoleEnv())

env = GymEnv("CartPole-v0", record_video=False, log_dir="exps/cartpole/")

# Initialize a neural network policy with a single hidden layer of 8 hidden units
policy = CategoricalMLPPolicy(env.spec, hidden_sizes=(4,))

# We will collect 100 trajectories per iteration
N = 10
# Each trajectory will have at most 100 time steps
T = 100
# Number of iterations
n_itr = 100
# Set the discount factor for the problem
discount = 0.99
# Learning rate for the gradient update
learning_rate = 0.01

# Construct the computation graph

# Create a Theano variable for storing the observations
# We could have simply written `observations_var = TT.matrix('observations')` instead for this example. However,
# doing it in a slightly more abstract way allows us to delegate to the environment for handling the correct data
# type for the variable. For instance, for an environment with discrete observations, we might want to use integer
# types if the observations are represented as one-hot vectors.
observations_var = env.observation_space.new_tensor_variable(
    'observations',
    # It should have 1 extra dimension since we want to represent a list of observations
    extra_dims=1
)
actions_var = env.action_space.new_tensor_variable(
    'actions',
    extra_dims=1
)
returns_var = TT.vector('returns')

# policy.dist_info_sym returns a dictionary, whose values are symbolic expressions for quantities related to the
# distribution of the actions. For a Gaussian policy, it contains the mean and the logarithm of the standard deviation.
dist_info_vars = policy.dist_info_sym(observations_var)

# policy.distribution returns a distribution object under rllab.distributions. It contains many utilities for computing
# distribution-related quantities, given the computed dist_info_vars. Below we use dist.log_likelihood_sym to compute
# the symbolic log-likelihood. For this example, the corresponding distribution is an instance of the class
# rllab.distributions.DiagonalGaussian
dist = policy.distribution

# Note that we negate the objective, since most optimizers assume a minimization problem
surr = - TT.mean(dist.log_likelihood_sym(actions_var, dist_info_vars) * returns_var)

# Get the list of trainable parameters.
params = policy.get_params(trainable=True)
grads = theano.grad(surr, params)

f_train = theano.function(
    inputs=[observations_var, actions_var, returns_var],
    outputs=None,
    updates=adam(grads, params, learning_rate=learning_rate),
    allow_input_downcast=True
)

for _ in range(n_itr):

    paths = []

    for _ in range(N):
        observations = []
        actions = []
        rewards = []

        observation = env.reset()

        for _ in range(T):
            # policy.get_action() returns a pair of values. The second one returns a dictionary, whose values contains
            # sufficient statistics for the action distribution. It should at least contain entries that would be
            # returned by calling policy.dist_info(), which is the non-symbolic analog of policy.dist_info_sym().
            # Storing these statistics is useful, e.g., when forming importance sampling ratios. In our case it is
            # not needed.
            action, _ = policy.get_action(observation)
            # Recall that the last entry of the tuple stores diagnostic information about the environment. In our
            # case it is not needed.
            next_observation, reward, terminal, _ = env.step(action)
            observations.append(observation)
            actions.append(env.action_space.flatten(action))
            rewards.append(reward)
            observation = next_observation
            if terminal:
                # Finish rollout if terminal state reached
                break

        # We need to compute the empirical return for each time step along the
        # trajectory
        returns = []
        return_so_far = 0
        for t in range(len(rewards) - 1, -1, -1):
            return_so_far = rewards[t] + discount * return_so_far
            returns.append(return_so_far)
        # The returns are stored backwards in time, so we need to revert it
        returns = returns[::-1]

        paths.append(dict(
            observations=np.array(observations),
            actions=np.array(actions),
            rewards=np.array(rewards),
            returns=np.array(returns)
        ))

    observations = np.concatenate([p["observations"] for p in paths])
    actions = np.concatenate([p["actions"] for p in paths])
    returns = np.concatenate([p["returns"] for p in paths])

    f_train(observations, actions, returns)
    print('Average Return:', np.mean([sum(p["rewards"]) for p in paths]))

env.close()
gym.upload('/exps/cartpole/', api_key='sk_6jB59OwBQICeXwMlUsqUBw')

I follow your advice but now it gives me

/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 17:46:55,728] Making new env: CartPole-v0
[2017-02-28 17:46:55,735] Creating monitor directory exps/cartpole/
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 87, in <module>
    observation = env.reset()
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 105, in reset
    recorder = self.env._monitor.stats_recorder
AttributeError: '_Monitor' object has no attribute '_monitor'
[2017-02-28 17:46:56,563] Finished writing results. You can upload them to the scoreboard via gym.upload('/Users/yxchng/Desktop/rllab-master/examples/exps/cartpole')
(rllab3)Chngs-MacBook-Pro-3:examples yxchng$ python vpg_1_acrobot.py 
/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/theano/tensor/signal/downsample.py:6: UserWarning: downsample module has been moved to the theano.tensor.signal.pool module.
  "downsample module has been moved to the theano.tensor.signal.pool module.")
Traceback (most recent call last):
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 10, in <module>
    monitor_logger.setLevel(logging.WARNING)
NameError: name 'logging' is not defined
[2017-02-28 17:47:38,920] Making new env: CartPole-v0
[2017-02-28 17:47:38,927] Clearing 2 monitor files from previous run (because force=True was provided)
Average Return: 34.7
Average Return: 35.4
Average Return: 36.6
Traceback (most recent call last):
  File "vpg_1_acrobot.py", line 87, in <module>
    observation = env.reset()
  File "/Users/yxchng/Desktop/rllab-master/rllab/envs/gym_env.py", line 108, in reset
    return self.env.reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/core.py", line 123, in reset
    observation = self._reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 39, in _reset
    self._before_reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/wrappers/monitoring.py", line 190, in _before_reset
    self.stats_recorder.before_reset()
  File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/envs/rllab3/lib/python3.5/site-packages/gym/monitoring/stats_recorder.py", line 68, in before_reset
    raise error.Error("Tried to reset environment which is not done. While the monitor is active for {}, you cannot call reset() unless the episode is over.".format(self.env_id))
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

@yxchng
Copy link
Author

yxchng commented Feb 28, 2017

I tried using force_reset=True but it doesnt work.

@dementrock
Copy link
Member

Hi,

Make sure you are using this commit of gym exactly: 93d554bdbb4b2d29ff1a685158dbde93b36e3801

Refer to https://github.com/openai/rllab/blob/master/environment.yml. Make sure you are using the latest rllab code.

@gy2256
Copy link

gy2256 commented Apr 1, 2017

I also have the same error

raise error.Error("Tried to reset environment which is not done. While the monitor is active for {}, you cannot call reset() unless the episode is over.".format(self.env_id))
gym.error.Error: Tried to reset environment which is not done. While the monitor is active for CartPole-v0, you cannot call reset() unless the episode is over.

@dementrock
Copy link
Member

@gyang1011 set force_reset to True.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants