Skip to content

Reinforcement learning algorithm that blends the N-th order Markov property with abstract MDPs, PPO, and a hybrid model-free/model-based approach.

Notifications You must be signed in to change notification settings

ndalton12/AMPED

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

60 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AMPED

Advantaged Markovian Proxy Evolution Dynamics (AMPED)

AMPED is an iterative improvement on the MuZero algorithm. There are two important changes relative MuZero: (1) AMPED combines the MuZero objective and the PPO objective; (2) AMPED uses an n-th order Markov evolution dynamics (NOMAD) function instead of the first order Markov dynamics function used in MuZero. Specifically, the AMPED objective is formulated as follows: $L(θ) = −L_{CLIP} + L_{MU} − L_{ENTROPY}$

The AMPED objective is minimized using standard gradient descent techniques, i.e. ADAM. AMPED extends the first order Markov dynamics function used in MuZero by allowing $g$ (the dynamics function) to take in n previous states, $s_{i−n}, ..., s_i$. The $n$ states are initialized to $s_0$, as generated by the representation function $h$. Formally, we have: $gθ(s_{k−n−1}, ..., s_{k−1}) = r_k, s_k$. This allows AMPED to break the standard Markov assumption (that all states rely only on the previous state). We hypothesize that breaking this assumption by introducing the n-th order Markov chain input will allow the dynamics function to generalize better over time.

Finally, AMPED uses an empirical advantage estimate during MCTS backup phase. This advantage is calculated from the predicted Q-value and the predicted value (from the prediction function $f$), rather than solely using the Q-value as MuZero does. We refer the readers to [5] for details on the MCTS backup phase.

We find that AMPED compares at least as good as PPO and MuZero on reinforcement learning problems.

See the paper in this repository for more details, as well as a re-implementation of basic PPO and MuZero algorithms.

About

Reinforcement learning algorithm that blends the N-th order Markov property with abstract MDPs, PPO, and a hybrid model-free/model-based approach.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages