Skip to content

implementation of PPO(proximal policy optimization) using pytorch

Notifications You must be signed in to change notification settings

Git-123-Hub/PPO-pytorch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ppo-pytorch

implementation of ppo(proximal policy optimization) using pytorch

training result

red line represents the goal of the environment, specified by open ai gym

note that not all of these goals are reached,
but does achieve similar results to figure 3 of the original paper,
and better than results in Benchmarks for Spinning Up Implementations.

note that I didn't specify seed, so you may get a different result,
however, according to my experience, this code could achieve similar results across different seeds,
so you can get a result that is not so bad after trying with a few seeds(or even not specified).

Pendulum-v0

python main.py --env-name "Pendulum-v0" --learning-rate 0.0003 --learn-interval 1000 --batch-size 200 --total-steps 300000 --num-process 3
reward and running reward multiple running rewards
Pendulum-v0 Pendulum-v0(multiple run)

HalfCheetah-v3

python main.py --env-name "HalfCheetah-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0007 --batch-size 2000
reward and running reward multiple running rewards
HalfCheetah-v3 HalfCheetah-v3(multiple run)

Swimmer-v3

python main.py --env-name "Swimmer-v3" --total-steps 1000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward multiple running rewards
Swimmer-v3 Swimmer-v3(multiple run)

Hopper-v3

python main.py --env-name "Hopper-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward multiple running rewards
Hopper-v3 Hopper-v3(multiple run)

Walker2d-v3

python main.py --env-name "Walker2d-v3" --total-steps 5000000 --learn-interval 2000 --learning-rate 0.0005 --batch-size 1000 --std-decay
reward and running reward multiple running rewards
Walker2d-v3 Walker2d-v3(multiple run)

reference

todo

  • discrete action

About

implementation of PPO(proximal policy optimization) using pytorch

Topics

Resources

Stars

Watchers

Forks

Languages