什么是Experience Replay和Seperate Target Networks

什么是Experience Replay, Seperate Target Network

最近看到的一篇论文中提到的面对RL network不稳定甚至发散两个方法。
- non-linear function approximator is unstable or even to diverge.
- In RL, it's common to leverage a neural network as the function approximator.
阅读Human-level control through deep reinforcement learning的笔记。

Experience Replay

Goals

问题：
- ML都假设数据是IID。
- 但是RL连续online训练中，连续的samples有着很强的关联，所以可能导致network会陷入局部最小值。
优点：
- To smooth out learning, avoid oscillations or divergence in the parameters.
- Randomizing the samples breaks these correlations and therefore reduces the variance of the updates.
- Experience replay使训练任务更加像常见的监督学习了，可以简化调试、测试算法。

Method

主要是用一个buffer存之前的experiences <s, a, r, s'>。
每次从update的时候均匀地从Buffer中随机sample来用。

Seperate Target Network

Goals and Strengths

To improve the stability of method.
Reduces oscillations or divergence of the policy.

Method

主要是每C次updates之后，就Clone Q网络作为target network Q'，Q' 用于生成targets（假定的现实值）。
Target network是旧的参数network，prediction network是在更新的network。
用旧的参数来生成targets给更新Q和update影响targets之间产生了一个delay，因此making divergence or oscillations much more unlikely.

算法

使用experience replay and seperate target network.

Reference

Human-level control through deep reinforcement learning

posted @ 2021-12-23 15:31 xxxuanei 阅读(110) 评论(0) 编辑收藏举报

刷新页面返回顶部