(zhuan) Prioritized Experience Replay

Schaul, Quan, Antonoglou, Silver, 2016

This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html

Summary

Uniform sampling from replay memories is not an efficient way to learn. Rather, using a clever prioritization scheme to label the experiences in replay memory, learning can be carried out much faster and more effectively. However, certain biases are introduced by this non-uniform sampling; hence, weighted importance sampling must be employed in order to correct for this. It is shown through experimentation with the Atari Learning Environment that prioritized sampling with Double DQN significantly outperforms the previous state-of-the-art Atari results.

Evidence

Implemented Double DQN with main changes being the addition of prioritized experience replay sampling and importance-sampling
Tested on Atari Learning Environment

Strengths

Lots of insight about the repercussions of this research and plenty of discussion on extensions

Notes

The magnitude of the TD-error indicates how unexpected a certain transition was
The TD-error can be a poor estimate about the amount an agent can learn from a transition when rewards are noisy
Problems with greedily selecting experiences:
- High-error transitions are replayed too frequently
- Low-error transitions are almost entirely ignored
- Expensive to update entire replay memory, so errors are only updated for transitions that are replayed
- Lack of diversity leads to over-fitting
A stochastic sampling method is introduced which finds a balance between greedy prioritization and random sampling (current method)
Two variants of $P (i) = \frac{p_{i}^{α}}{\sum_{k} p_{k}^{α}}$
- Variant 1: proportional prioritization, where $p_{i} = | δ_{i} | + ϵ$
- Variant 2: rank-based prioritization, with $p_{i} = \frac{1}{r a n k (i)}$
Key insight The estimation of the expected value of the total discounted reward with stochastic updates requires that the updates correspond to the same distribution as the expectation. Prioritized replay introduces a bias that changes this distribution uncontrollably. This can be corrected by using importance-sampling (IS) weights $w_{i} = (\frac{1}{N} \frac{1}{P (i)})^{β}$
IS is annealed from $β_{0}$
IS also reduces the gradient magnitudes which is good for optimization; allows the algorithm to follow the curvature of highly non-linear optimization landscapes because the Taylor expansion (gradient descent) is constantly re-approximated

The Blog of Xiao Wang

Associate Professor, School of Computer Science and Technology, Anhui University, Email: xiaowang@ahu.edu.cn

Prioritized Experience Replay

This Blog from: http://pemami4911.github.io/paper-summaries/2016/01/26/prioritizing-experience-replay.html

Summary

Evidence

Strengths

Notes

公告