Learning Policy
理论基础
Policy Gradient:
\[R_\theta = \sum_\tau reward(\tau)p_\theta(\tau)
\\
\nabla R_\theta = \sum_\tau reward(\tau)
\nabla p_\theta(\tau) \\
= \sum_\tau reward(\tau)
p_\theta(\tau) \frac{\nabla p_\theta(\tau)}
{p_\theta(\tau)} \\
= \sum_\tau reward(\tau) p_\theta(\tau)
\nabla logp_\theta(\tau) \\
= E_{\tau -p_\theta(\tau)}[reward(\tau)\nabla logp_\theta(\tau)] \\
\approx \frac1n \sum_{\tau=1}^n
reward(\tau)\nabla logp_\theta(\tau)\\
= \frac1n \sum_{\tau=1}^n
reward(\tau)\nabla log
(p(s_{\tau1})p_\theta(a_{\tau1}|s_{\tau1})p(s_{\tau2}|s_{\tau1},a_{\tau1})p_\theta(a_{\tau2}|s_{\tau2})...)\\
=\frac1n \sum_{\tau=1}^n reward(\tau) \sum_{t=1}^{T_\tau}
\nabla logp_{\theta}(a_{\tau t}|s_{\tau t}) \\
=\frac1n \sum_{\tau=1}^n \sum_{t=1}^{T_\tau} reward(\tau)
\nabla logp_{\theta}(a_{\tau t}|s_{\tau t}) \\
\]
Proximal Policy Optimization:
\[\nabla R_\theta = \sum_\tau reward(\tau) p_\theta(\tau)
\nabla logp_\theta(\tau) \\
= \sum_\tau reward(\tau) p_{\theta'}(\tau) \frac{p_{\theta}(\tau)}{p_{\theta'}(\tau)}
\nabla logp_\theta(\tau) \\
= E_{\tau -p_{\theta'}(\tau)}[\frac{p_{\theta}(\tau)}{p_{\theta'}(\tau)}
reward(\tau)\nabla logp_\theta(\tau)] \\
\approx \frac1n \sum_{\tau=1}^n
\frac{p_{\theta}(\tau)}{p_{\theta'}(\tau)}
reward(\tau)\nabla logp_\theta(\tau),\ 用\theta'采样\\
=\frac1n \sum_{\tau=1}^n
\frac{
p(s_{\tau1})p_\theta(a_{\tau1}|s_{\tau1})p(s_{\tau2}|s_{\tau1},a_{\tau1})p_\theta(a_{\tau2}|s_{\tau2})...
}{
p(s_{\tau1})p_{\theta'}(a_{\tau1}|s_{\tau1})p(s_{\tau2}|s_{\tau1},a_{\tau1})p_{\theta'}(a_{\tau2}|s_{\tau2})...
}
reward(\tau) \sum_{t=1}^{T_\tau}
\nabla logp_{\theta}(a_{\tau t}|s_{\tau t}) \\
\approx
\frac1n \sum_{\tau=1}^n \sum_{t=1}^{T_\tau}
\frac{p_{\theta}(a_{\tau t}|s_{\tau t})}{p_{\theta'}(a_{\tau t}|s_{\tau t})}
reward(\tau)
\nabla logp_\theta(a_{\tau t}|s_{\tau t})
\\
对每一步进行裁剪 : min(
\begin{matrix}
reward(\tau)\frac{p_{\theta}(a_{\tau t}|s_{\tau t})}{p_{\theta'}(a_{\tau t}|s_{\tau t})} \\
reward(\tau)clip(
\frac{p_{\theta}(a_{\tau t}|s_{\tau t})}{p_{\theta'}(a_{\tau t}|s_{\tau t})},
1-\epsilon,1+\epsilon) \\
\end{matrix}
)
\]
理论优化
实质上是训练智能体在某一个特定状态下做出最优的决策。这里的最优可以只关注长期优势,也可以是权衡长期优势与短期优势的折中。监督信号是由环境反馈建立的优势函数 A(s,a),即在当前状态s下采取某个动作的长期优势是多少。以此可以设计任务:
\[R_\theta = \sum_{s, a}A(s, a)p_\theta(s,a)
\]
以上公式中,我们要让R越大越好,即鼓励智能体在某个确定的状态下大概率做出能带来长期优势的明智决策。
\[\nabla R_\theta = \sum_{s,a}A(s,a) \nabla p_\theta(s,a) \\
= \sum_{s,a}A(s,a) p_\theta(s,a) \nabla logp_\theta(s,a) \\
= E_{s,a -p_\theta(s,a)}[A(s,a) \nabla logp_\theta(s,a)] \\
\approx \frac1{samples} \sum_{s,a} A(s,a)\nabla log(p(s)p_\theta(a|s))\\
= \frac1{samples} \sum_{s,a} A(s,a)\nabla logp_\theta(a|s)\\
\]
以上公式中 $$p(s,a)=p_\theta(s)p_\theta(a|s)$$ ,其中 $$p_\theta(s)$$ 是受智能体的上一个决策影响的,由于使用Log函数后将梯度影响由乘性变为加性,因此由 $$p_\theta(a|s)$$ 主导。接下来把以上公式变为离线操作:
\[\nabla R_\theta = \sum_{s,a}A(s,a) \nabla p_\theta(s,a) \\
= \sum_{s,a}A(s,a) p_\theta(s,a) \nabla logp_\theta(s,a) \\
= \sum_{s,a}A(s,a) p_{\theta'}(s,a) \frac{p_{\theta}(s,a)}{p_{\theta'}(s,a)} \nabla logp_\theta(s,a) \\
= E_{s,a -p_{\theta'}(s,a)}[\frac{p_{\theta}(s,a)}{p_{\theta'}(s,a)} A(s,a) \nabla logp_\theta(s,a)] \\
\approx \frac1{samples} \sum_{s,a}
\frac{p_\theta(s) p_{\theta}(a|s)}{p_{\theta'}(s) p_{\theta'}(a|s)}
A(s,a)\nabla log(p_\theta(s)p_\theta(a|s))\\
\approx \frac1{samples} \sum_{s,a}
min(
\begin{matrix}
\frac{p_{\theta}(a|s)}{p_{\theta'}(a|s)} A(s,a) \\
clip(
\frac{p_{\theta}(a|s)}{p_{\theta'}(a|s)}, 1-\epsilon,1+\epsilon) A(s,a) \\
\end{matrix}
)
\nabla logp_\theta(a|s)\\
\]
\[\frac{p_{\theta}(s,a)}{p_{\theta'}(s,a)}$$ 的意义为,我看别人的动作来调整我的决策,如果别人产生某种决策的概率比我大那么减少这一采样的权重,反之我比别人更容易产生某种决策就增加这一采样的权重。以上近似的基础为两智能体相差不太大,通过对梯度乘项的大小限制来约束智能体差异。
\]