GAE&reward shaping
策略算法(如TRPO,PPO)是一种流行的on-policy方法。它可以提供无偏差的(或近似无偏差)梯度估计,但同时会导致高的方差。而像Q-learning 和离线的actor-critic(如DDPG)等off-policy方法则可以用离线的样本来替代。它们可以使用其他学习过程产生的样本。这样的方法大大提高了采样的效率。不过并不能保证非线性函数逼近能够收敛。
"介于回合更新与单步更新之间的算法"
GAE 它既可以被看做使用off-policy的评价过程来减小策略梯度方法带来的方差,又被看作使用on-policy蒙特卡洛方法来修正评价梯度方法带来的偏差。 0 <λ<1的广义优势估计量在偏差和方差之间折衷,由参数λ控制。\(\lambda=0\)时方差低,偏差大,\(\lambda=1\)时方差高,偏差低。
1、TD($\lambda $)
Monte Carlo算法需要运行完整的episode,利用所有观察到的真是的reward(奖励值)来更新算法。Temporal Difference(TD)算法仅当前时刻采样的reward(奖励值)进行value function的估计。一个折中的方法就是利用n步的reward(奖励进行估计)。
n步返回的定义:\(R_t^{n} \doteq r_{t+1}+\gamma r_{t+2}+\cdots+\gamma^{n-1} r_{t+n}+\gamma^{n} \hat{v}\left(S_{t+n}, \mathrm{w}_{t+n-1}\right), \quad 0 \leq t \leq T-n\)
加上权重值后:\(R_{t}^{\lambda } \doteq(1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} R_t^{n}\)
第一列实际上是TD(1)方法,其权重为1-λ
,第二列是TD(2),权重为(1-λ)λ
…,直到最后一个TD(n)的权重为\(λ^{T-t-1})\)( T是episode的长度)。权重随n的增加而衰减,总和为1。因为\(\sum_{n=1}^{\infty} \lambda^{n-1}=\sum_{n=0}^{\infty} \lambda^{n}=\frac{1}{1-\lambda}\)
2、$\gamma $-just
定义1 估计量$$
\hat{A}_{t}
\underset{s_{0} ; \infty \atop a_{0: \infty}}{\mathbb{E}}\left[\hat{A}{t}\left(s, a_{0: \infty}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right]=\underset{s_{0: \infty} \atop a_{0: \infty}}{\mathbb{E}}\left[A^{\pi, \gamma}\left(s_{t}, a_{t}\right) \nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right]
\hat{A}_{t}
\hat{A}{t}\left(s, a_{0: \infty}\right)=Q_{t}\left(s_{t: \infty}, a_{t: \infty}\right)-
b_{t}\left(s_{0: t}, a_{0: t-1}\right)
\left(s_{t}, a_{t}\right), \quad \mathbb{E}{s: \infty, a_{t+1: \infty}} |{s, a_{t}}\left[Q_{t}\left(s_{t: \infty}, a_{t: \infty}\right)\right]=Q^{\pi, \gamma}\left(s_{t}, a_{t}\right)
\hat{A}_{t}
\begin{array}{l}
{\mathbb{E}{s, a_{0: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\left(Q_{t}\left(s_{0: \infty}, a_{0: \infty}\right)-b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right)\right]} \
{\quad=\mathbb{E}{s, a_{0: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\left(Q_{t}\left(s_{0: \infty}, a_{0: \infty}\right)\right)\right]} \
{\quad-\mathbb{E}{s, a_{0: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\left(b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right)\right]}
\end{array}
\begin{array}{l}
{\mathbb{E}{s: \infty}, a_{0: \infty}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) Q_{t}\left(s_{0: \infty}, a_{0: \infty}\right)\right]} \
{\quad=\mathbb{E}{s, a_{0: t}}\left[\mathbb{E}{s, a_{t+1 ; \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) Q_{t}\left(s_{0: \infty}, a_{0: \infty}\right)\right]\right]} \
{=\mathbb{E}{s, a_{0: t}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) \mathbb{E}{s: \infty, a_{t+1: \infty}}\left[Q_{t}\left(s_{0: \infty}, a_{0: \infty}\right)\right]\right]} \
{=\mathbb{E}{s, a_{0: t-1}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) A^{\pi}\left(s_{t}, a_{t}\right)\right]}
\end{array}
\begin{array}{l}
{\mathbb{E}{s, a_{0: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right]} \
{\quad=\mathbb{E}{s, a_{0: t-1}}\left[\mathbb{E}{s, a_{t: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right) b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right]\right]} \
{\quad=\mathbb{E}{s, a_{0: t-1}}\left[\mathbb{E}{s, a_{t: \infty}}\left[\nabla_{\theta} \log \pi_{\theta}\left(a_{t} | s_{t}\right)\right] b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right]} \
{=\mathbb{E}{s, a_{0: t-1}}\left[0 \cdot b_{t}\left(s_{0: t}, a_{0: t-1}\right)\right]} \
{=0}
\end{array}
V=V^{\pi, \gamma}
V=V^{\pi, \gamma}
\begin{aligned}&\hat{A}{t}{(1)}:=\delta_{t} \quad=-V\left(s\right)+r_{t}+\gamma V\left(s_{t+1}\right)\&\hat{A}{t}{(2)}:=\delta_{t}+\gamma \delta^{V} \quad=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\gamma^{2} V\left(s_{t+2}\right)\&\hat{A}{t}{(3)}:=\delta_{t}+\gamma \delta{V}+\gamma \delta_{t+2}^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\gamma^{2} r_{t+2}+\gamma^{3} V\left(s_{t+3}\right)\&\hat{A}{t}{(k)}:=\sum_{l=0} \gamma^{l} \delta^{V}=-V\left(s_{t}\right)+r_{t}+\gamma r_{t+1}+\cdots+\gamma^{k-1} r_{t+k-1}+\gamma^{k} V\left(s_{t+k}\right)\end{aligned} (11-14)\\hat{A}{t}{(\infty)}=\sum_{l=0} \gamma^{l} \delta{V}=-V\left(s_{t}\right)+\sum_{l=0} \gamma^{l} r_{t+l} (15)\\begin{aligned}\hat{A}{t}^{\mathrm{GAE}(\gamma, \lambda)} &:=(1-\lambda)\left(\hat{A}^{(1)}+\lambda \hat{A}{t}{(2)}+\lambda \hat{A}^{(3)}+\ldots\right) \&=(1-\lambda)\left(\delta_{t}{V}+\lambda\left(\delta_{t}+\gamma \delta_{t+1}{V}\right)+\lambda\left(\delta_{t}^{V}+\gamma \delta_{t+1}{V}+\gamma \delta_{t+2}^{V}\right)+\ldots\right) \&=(1-\lambda)\left(\delta_{t}{V}+\lambda\left(\delta_{t}+\lambda \delta_{t+1}{V}\right)+\lambda\left(\delta_{t}^{V}+\gamma \delta_{t+1}{V}+\gamma \delta_{t+2}^{V}\right)+\ldots\right) \&=(1-\lambda)\left(\delta_{t}{V}\left(1+\lambda+\lambda+\ldots\right)+\gamma \delta_{t+1}{V}\left(\lambda+\lambda+\lambda{2}+\ldots\right)\right.\&\left.\quad+\gamma \delta_{t+2}{V}\left(\lambda+\lambda{3}+\lambda+\ldots\right)+\ldots\right) \&=(1-\lambda)\left(\delta_{t}^{V}\left(\frac{1}{1-\lambda}\right)+\gamma \delta_{t+1}{V}\left(\frac{\lambda}{1-\lambda}\right)+\gamma \delta_{t+2}{V}\left(\frac{\lambda{2}}{1-\lambda}\right)+\ldots\right) \&=\sum_{l=0}^{\infty}(\gamma \lambda)^{l} \delta_{t+l}^{V}\end{aligned} (16)
\begin{aligned} \hat{Q}{M^{\prime}}(s, a) &=\mathbb{E} \sim P_{s a}}\left[r\left(s, a, s^{\prime}\right)+F\left(s, a, s^{\prime}\right)+\gamma \max {a^{\prime} \in A}\left(\hat{Q}{M{\prime}}\left(s, a^{\prime}\right)\right)\right] \ &=\mathbb{E}{s^{\prime} \sim P{s a}}\left[r^{\prime}\left(s, a, s^{\prime}\right)+\gamma \max {a^{\prime} \in A}\left(\hat{Q}{M{\prime}}\left(s, a^{\prime}\right)\right)\right] \end{aligned}