Fig.1: Reinforcement learning algorithm map
The value iteration algorithm tries to find an optimal policy by finding the optimal value function.
Q-learning updates the action value according to the following equation:
When the learning converges, the second term of the equation above approaches to zero. Note that when the policy that never takes some of the pairs of state and action, the action value function for the pair will never be learned, and learning will not properly converge.
DQN is Q-Learning with a deep neural network as a Q function approximator. DQN learns to minimize the TD error with some evaluation function .
Generally, the following error is used as the evaluation function.
It is known that it will degrade the learning stability if the target Q value is calculated with the same parameters as the updating parameters. In order to stablize the learning process, the parameter , which is a periodic copy of the , is used instead.
When the series of inputs which have strong correlation are used to train a network, the parameters of the network will be updated according to the recent similar inputs, resulting in degrading an estimation for older inputs and preventing from a convergennce of learning. In order to restrain the sampling bias for the training, experience replay was introduced. In experience replay, the obtained experience is stored to the memory (or experience buffer) and later sampled according to the uniform distribution.
The idea of Double Q-learning is to reduce overestimations by decomposing the max operation in the target into action selection and action evaluation.
whereGOogle ReInforcement Learning Architecture (GORILA) is parallelized DQN architecture.
The information within a Q function can be divided into two: a part determined mostly by state; and a part influenced by an action choosed. Dueling network separates the Q function into Value, a part that is determined by state, and Advantage, a part that is influenced by the action. This enables the model to learn the parameters that is related to Value every step regardless of action choosed, i.e. the model can learn faster than DQN.
Fig.2: A popular single streamQ-network (top) and the duel-ingQ-network (bottom).
Rainbow commbines DQN with six extensions (the number of colors in a rainbow!) that address the limitaions of the original DQN algorithm. The extensions are: 1.DQN, 2.DDQN, 3.Prioritized experience replay, 4.Dueling network, 5.Multi-step learning, 6.Distributional RL, and 7.Noisy network.
Fig.3: Median human-normalized performanceacross57 Atari games, as a function of time.
Experience replay enabled the sampling independent from markov property. However, the sampling was done based on the uniform distribution, and importance of each sample was neglected. Prioritized experience replay resolves the problem by weighting each sample; th sample has importance of . The sampling is done according to the importance as follows:
The estimation of the expected value with stochastic updates relies on those updates correspondingto the same distribution as its expectation. Prioritized experience replay introduces bias because it changes this distribution in an uncontrolled fashion.
These weights can be folded into the Q-learning update by using instead of .
Multi-step learning combines Q-learning and Montecarlo to improve the estimation quality of the value. It takes the future n-steps rewards and the value at n-steps to calculate the .
The parameter n is very sensitive but for the Atari games, n=3 is seid to be good.
This is also the way to improve the estimation quality of the value. Distributional RL treats the reward as distribution whose mean and variance will reflect the state and action instead of "expectation value," which is basically the average of every rewards.
Fig.4: Agent differentiates action-value distributions under pressure.
Noisy network improves the exploration efficiency by letting the model to learn the exploration rate for itself.
Note that when , noisy network is equivalent to the normal linear layer.
Advantage Actor-Critic (A2C) is a synchronous, deterministic variant of Asynchronous Advantage Actor Critic (A3C). It uses multiple workers to avoid the use of a replay buffer.
Hierarchical Actor Critic (HAC) learns to reach a goal state by dividing the task into short horizon intermediate goals (subgoals).
Deep Deterministic Policy Gradient (DDPG) is an off-policy, model-free, and actor-critic algorithm.
DDPG is frequently brittle with respect to hyperparameters and tunings. A common failure mode for DDPG is that the learned Q-function begins to dramatically overestimate Q-values, which then leads to the policy breaking, exploiting the errors in the Q-function. Twin Delayed DDPG (TD3) is a direct successor of DDPG and improves it using three major tricks: clipped double Q-Learning, delayed policy update and target policy smoothing.
Clipped noise will be added on each dimension of the action. After adding the clipped noise, the target action is then clipped to lie in the valid action range.
Both Q-functions use a single target, calculated using whichever of the two Q-functions gives a smaller target value. Use of the smaller Q-value for the target, and regressing towards that, helps to eliminate the overestimation in the Q-function.
Then the policy is learned just by maximizing .
The policy iteration algorithm manipulates the policy directly, rather than finding it indirectly via the optimal value function.
Sarsa updates the action value according to the following equation:
Trust Region Policy Optimization (TRPO)
Proximal Policy Optimization (PPO) is a policy gradient based method.
Actor-Critic using Kronecker-Factored Trust Region (ACKTR) is a policy gradient method with the trust region optimization, which will reduce the complexity closer to a first-order optimization like the Gradient Descent.