Reinforcement Learning: An Introduction (second edition) - Chapter 11,12,13
Contents
Chapter 11
11.1
Convert the equation of n-step off-policy TD (7.9) to semi-gradient form. Give accompanying definitions of the return for both the episodic and continuing cases.
- 将(7.9)式\(V_{t+n}(S_t)=V_{t+n-1}(S_t)+\alpha\rho_{t:t+n-1}[G_{t:t+n}-V_{t+n-1}(S_t)]\)改为
\[\textbf{w}_{t+n} \overset{.}{=} \textbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n-1}\delta_{t+n-1} \nabla \hat v(S_t,\textbf{w}_{t+n-1})
\]
episodic case 有
\[\delta_{t+n-1} \overset{.}{=} R_{t+1} + \cdots + \gamma^{n-1} R_{t+n} + \gamma^n\hat v(S_{t+n},\textbf{w}_{t+n-1})-\hat v(S_t,\textbf{w}_{t+n-1})
\]
continuing case 有
\[\delta_{t+n-1} \overset{.}{=} R_{t+1}-\bar R_t + \cdots + \bar R_{t+n} - \bar R_{t+n-1} + \hat v(S_{t+n},\textbf{w}_{t+n-1})-\hat v(S_t,\textbf{w}_{t+n-1})
\]
11.2
Convert the equations of n-step \(Q(\sigma)\) (7.11 and 7.17) to semi-gradient form. Give definitions that cover both the episodic and continuing cases.
- (7.11)式\(Q_{t+n}(S_t,A_t) \overset{.}{=} Q_{t+n-1}(S_t,A_t)+\alpha\rho_{t+1:t+n}[G_{t:t+n}-Q_{t+n-1}(S_t,A_t)]\),改为
\[\textbf{w}_{t+n} \overset{.}{=} \textbf{w}_{t+n-1} + \alpha \rho_{t+1} \cdots \rho_{t+n-1}\delta_{t+n-1} \nabla \hat q(S_t,\textbf{w}_{t+n-1})
\]
(7.17)式\(G_{t:h} \overset{.}{=} R_{t+1}+\gamma(\sigma_{t+1}\rho_{t+1}+(1-\sigma_{t+1})\pi(A_{t+1}|S_{t+1}))(G_{t+1:h}-Q_{h-1}(S_{t+1},A_{t+1}))+\gamma \bar V_{h-1}(S_{t+1})\),
episodic case 有
\[\delta_{t+n-1} \overset{.}{=} G_{t:h}-\hat q(S_t,A_t,\textbf{w}_{t+n-1})
\]
continuing case 有
\[\delta_{t+n-1} \overset{.}{=} G_{t:h}-\bar R_t-\hat q(S_t,A_t,\textbf{w}_{t+n-1})
\]
需要注意\(G_{t:h}\)是一个关于自己的递归式子,这里虽然只减去了一个\(\bar R\),实际在写伪代码的时候,每次递归算一个\(G\)都要减去一个\(\bar R\)。
11.3
(programming) Apply one-step semi-gradient Q-learning to Baird’s counterexample and show empirically that its weights diverge.
11.4
Prove (11.24). Hint: Write the \(\overline{\text{RE}}\) as an expectation over possible states \(s\) of the expectation of the squared error given that \(S_t = s\). Then add and subtract the true value of state \(s\) from the error (before squaring), grouping the subtracted true value with the return and the added true value with the estimated value. Then, if you expand the square, the most complex term will end up being zero, leaving you with (11.24).
\[\begin{array}{l}
\overline{\text{RE}}(\textbf{w})=E[(G_t-\hat v(S_t,\textbf{w}))^2]
\\ \\
\qquad \quad \ = \sum_{s \in S}\mu(s)[(G_t-\hat v(s,\textbf{w}))^2]
\\ \\
\qquad \quad \ = \sum_{s \in S}\mu(s)[(G_t-v_\pi(s)+v_\pi(s)-\hat v(s,\textbf{w}))^2]
\\ \\
\qquad \quad \ = \sum_{s \in S}\mu(s)[(G_t-v_\pi(s))^2+(v_\pi(s)-\hat v(s,\textbf{w}))^2+2(G_t-v_\pi(s))(v_\pi(s)-\hat v(s,\textbf{w}))]
\\ \\
\qquad \quad \ = \sum_{s \in S}\mu(s)[(G_t-v_\pi(s))^2]+ \sum_{s \in S}\mu(s)[(v_\pi(s)-\hat v(s,\textbf{w}))^2]+ \sum_{s \in S}\mu(s)[2(G_t-v_\pi(s))(v_\pi(s)-\hat v(s,\textbf{w}))]
\\ \\
\qquad \quad \ = E[(G_t-v_\pi(S_t))^2]+\overline{\text{VE}}(\textbf{w}) + E[2(G_t-v_\pi(s))(v_\pi(s)-\hat v(s,\textbf{w}))]
\\ \\
\qquad \quad \ = E[(G_t-v_\pi(S_t))^2]+\overline{\text{VE}}(\textbf{w})
\end{array}
\]
Chapter 12
12.1
Just as the return can be written recursively in terms of the first reward and itself one-step later (3.9), so can the \(\lambda\)-return. Derive the analogous recursive relationship from (12.2) and (12.1).
- 首先有\(G_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t:t+n}, G_{t+1}^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t+1:t+n+1}\),且
\[\begin{array}{l}
G_{t:t+n} = R_{t+1}+\gamma R_{t+2}+\cdots +\gamma^{n-1}R_{t+n}+\gamma^n\hat v
\\ \\
\qquad \ \ \ = R_{t+1}+\gamma (R_{t+2}+\cdots +\gamma^{n-2}R_{t+n}+\gamma^{n-1}\hat v)
\\ \\
\qquad \ \ \ = R_{t+1}+\gamma G_{t+1:t+n}
\end{array}
\]
则有
\[\begin{array}{l}
G_t^\lambda = (1-\lambda)\sum_{n=1}^{\infty}\lambda^{n-1}G_{t:t+n}
\\ \\
\quad \ \ = (1-\lambda)[G_{t:t+1}+\lambda G_{t:t+2}+\lambda^2G_{t:t+3}+\cdots]
\\ \\
\quad \ \ = (1-\lambda)[(R_{t+1}+\gamma G_{t+1:t+1})+\lambda (R_{t+1}+\gamma G_{t+1:t+2})+\lambda^2(R_{t+1}+\gamma G_{t+1:t+3})+\cdots]
\\ \\
\quad \ \ = (1-\lambda)[(R_{t+1}+\lambda R_{t+1} +\lambda^2 R_{t+1})+\cdots+\gamma G_{t+1:t+1}+\gamma \lambda( G_{t+1:t+2}+\lambda G_{t+1:t+3}+\cdots)]
\\ \\
\quad \ \ =R_{t+1}+ \gamma \lambda(1-\lambda)(G_{t+1:t+2}+\lambda G_{t+1:t+3}+\cdots) +(1-\lambda)\gamma G_{t+1:t+1}
\\ \\
\quad \ \ =R_{t+1}+ \gamma \lambda G_{t+1}^\lambda +(1-\lambda)\gamma \hat v(S_{t+1})
\end{array}
\]
12.2
The parameter \(\lambda\) characterizes how fast the exponential weighting in Figure 12.2 falls off, and thus how far into the future the \(\lambda\)-return algorithm looks in determining its update. But a rate factor such as \(\lambda\) is sometimes an awkward way of characterizing the speed of the decay. For some purposes it is better to specify a time constant, or half-life. What is the equation relating \(\lambda\) and the half-life, \(\tau_\lambda\), the time by which the weighting sequence will have fallen to half of its initial value?
\[\begin{array}{l}
1-\lambda = 2(1-\lambda)\lambda^{\tau_\lambda-1}
\\ \\
\Rightarrow \tau_{\lambda} = 1- \frac{\ln 2}{\ln \lambda}
\end{array}
\]
12.3
Some insight into how TD(\(\lambda\)) can closely approximate the offline \(\lambda\)-return algorithm can be gained by seeing that the latter’s error term (in brackets in (12.4)) can be written as the sum of TD errors (12.6) for a single fixed \(\textbf{w}\). Show this, following the pattern of (6.6), and using the recursive relationship for the \(\lambda\)-return you obtained in Exercise 12.1.
\[\begin{array}{l}
G_t^\lambda - \hat v(S_t) = R_{t+1}+ \gamma \lambda G_{t+1}^\lambda +(1-\lambda)\gamma \hat v(S_{t+1}) - \hat v(S_t) + \gamma \hat v(S_{t+1}) - \gamma\hat v(S_{t+1})
\\ \\
\qquad \qquad \quad = R_{t+1} + \gamma \hat v(S_{t+1}) - \hat v(S_t)+ \gamma \lambda G_{t+1}^\lambda -\gamma\lambda \hat v(S_{t+1})
\\ \\
\qquad \qquad \quad = \delta_t+ \gamma \lambda (G_{t+1}^\lambda - \hat v(S_{t+1}))
\\ \\
\qquad \qquad \quad = \delta_t+ \gamma \lambda (\delta_{t+1} + \gamma\lambda (G_{t+2}^\lambda - \hat v(S_{t+2})))
\\ \\
\qquad \qquad \quad \cdots
\\ \\
\qquad \qquad \quad = \sum_{k=0}^{\infty}(\gamma \lambda)^k\delta_{t+k}
\end{array}
\]
12.4
Use your result from the preceding exercise to show that, if the weight updates over an episode were computed on each step but not actually used to change the weights (\(\textbf{w}\) remained fixed), then the sum of TD(\(\lambda\))’s weight updates would be the same as the sum of the offline \(\lambda\)-return algorithm’s updates.
- 这个题的意思是说,我们每次都计算\(\delta\)和\(\textbf{z}\),先求和放起来,但是不更新,最后episode结束后一次更新\(\textbf{w}\leftarrow\textbf{w}+\alpha\delta\textbf{z}\)。先把\(\delta\)和\(\textbf{z}\)对应写出来有
\[\begin{array}{l}
\delta_0 | z_0=\nabla\hat v(S_0)
\\
\delta_1 | z_1= \gamma\lambda\nabla\hat v(S_0)+\nabla\hat v(S_1)
\\
\delta_2 | z_2= (\gamma\lambda)^2\nabla\hat v(S_0)+ \gamma\lambda\nabla\hat v(S_1)+\nabla\hat v(S_2)
\\
\delta_3 | z_3= (\gamma\lambda)^3\nabla\hat v(S_0)+ (\gamma\lambda)^2\nabla\hat v(S_1)+\gamma\lambda\nabla\hat v(S_2)+\nabla\hat v(S_3)
\\
\cdots
\\
\delta_n | z_n= (\gamma\lambda)^n\nabla\hat v(S_0)+ (\gamma\lambda)^{n-1}\nabla\hat v(S_1)+\cdots+\nabla\hat v(S_n)
\end{array}
\]
可以看到,如果我们要更新某个状态\(S_t\),那么\(\sum z\)里面只有\(z_t\)及之后的时刻有状态\(S_t\),所以之前的\(\delta\)不会对这个状态有影响,所以对某个状态\(S_t\)有
\[\sum \delta z=\delta_t \nabla\hat v(S_t) +\gamma \lambda\delta_{t+1} \nabla\hat v(S_t)+ (\gamma\lambda)^{2}\delta_{t+2}\nabla\hat v(S_t)+\cdots=\sum_{k=0}^{\infty}(\gamma \lambda)^k\delta_{t+k}=G_t^\lambda - \hat v(S_t)
\]
即和offline \(\lambda\)-return算法一致。
12.5
Several times in this book (often in exercises) we have established that returns can be written as sums of TD errors if the value function is held constant. Why is (12.10) another instance of this? Prove (12.10).
- 至于为啥这里\(\textbf{w}\)就要变化,我也不知道,只能说刚好就可以写成这样?所以真的是,相对于怎么想出来的,证明还是要简单许多啊。直接证吧。这里用一个最粗暴的方式,直接把\(G_{t:t+k}^\lambda\)和\(\sum_{i=t}^{t+k-1}(\gamma\lambda)^{i-t}\delta_i^\prime\)拆开写出来,对比每一项。
\[\begin{array}{l}
G_{t:t+k}^\lambda = (1-\lambda)G_{t:t+1}+(1-\lambda)\lambda G_{t:t+2}+(1-\lambda)\lambda^2 G_{t:t+3} + \cdots + (1-\lambda)\lambda^{k-2} G_{t:t+k-1}
\\ \\
\qquad \ \ \ = (1-\lambda)(R_{t+1}+\gamma\hat v(S_{t+1},\textbf{w}_{t}))
\\
\qquad \quad + (1-\lambda)\lambda(R_{t+1}+\gamma R_{t+2} + \gamma^2\hat v(S_{t+2},\textbf{w}_{t+1}))
\\
\qquad \quad + (1-\lambda)\lambda^2(R_{t+1}+\gamma R_{t+2} + \gamma^2R_{t+3}+\gamma^3\hat v(S_{t+3},\textbf{w}_{t+2}))
\\
\qquad \quad + \cdots
\\
\qquad \quad + (1-\lambda)\lambda^{k-2}(R_{t+1}+\gamma R_{t+2} + \cdots +\gamma^{k-2}R_{t+k-1}+\gamma^{k-1}\hat v(S_{t+k-1},\textbf{w}_{t+k-2}))
\\
\qquad \quad + \lambda^{k-1}(R_{t+1}+\gamma R_{t+2}+\cdots+\gamma^{k-1}R_{t+k}+\gamma^k\hat v(S_{t+k},\textbf{w}_{t+k-1}))
\\ \\ \\
\sum_{i=t}^{t+k-1}(\gamma\lambda)^{i-t}\delta_i^\prime = \delta_{t}^\prime + (\gamma\lambda)\delta_{t+1}^\prime + (\gamma\lambda)^2\delta_{t+2}^\prime + \cdots + (\gamma\lambda)^{k-t-1}\delta_{k-1}^\prime
\\ \\
\qquad \qquad \qquad \quad = R_{t+1} + \gamma \hat v(S_{t+1},\textbf{w}_t)-\hat v(S_t,\textbf{w}_{t-1})
\\
\qquad \qquad \qquad \quad + \gamma \lambda (R_{t+2} + \gamma \hat v(S_{t+2},\textbf{w}_{t+1})-\hat v(S_{t+1},\textbf{w}_{t}))
\\
\qquad \qquad \qquad \quad + (\gamma \lambda)^2 (R_{t+3} + \gamma \hat v(S_{t+3},\textbf{w}_{t+2})-\hat v(S_{t+2},\textbf{w}_{t+1}))
\\
\qquad \qquad \qquad \quad + \cdots
\\
\qquad \qquad \qquad \quad + (\gamma \lambda)^{k-t-1} (R_{t+k} + \gamma \hat v(S_{t+k},\textbf{w}_{t+k-1})-\hat v(S_{t+k-1},\textbf{w}_{t+k-2}))
\end{array}
\]
我们先对比\(R\),以\(R_{t+1}\)为例,第一列所有的带有\(\lambda\)的\(R_{t+1}\)求和后刚好等于\(R_{t+1}\),其他的\(R\)也可以一样消掉。接着看\(\hat v\),先看正的项,也就是\(1-\lambda\)里的\(1\),可以看到这些正项刚好抵消,同理负项也正好抵消,只留下一项\(-\hat v(S_t,\textbf{w}_{t-1})\),可以很容易看出来(12.10)式成立。
12.6
Modify the pseudocode for Sarsa(\(\lambda\)) to use dutch traces (12.11) without the other distinctive features of a true online algorithm. Assume linear function approximation and binary features.
- (12.11)式用的特征向量,这里用的binary features,也就是激活的话是1,否则为0。伪代码里面\(\mathcal{F}\)的意思有点模糊,如果里面存的都是1的特征,那直接把dutch traces套进去,改成用循环的方式计算就好了。题目里又说without the other distinctive features,就是说其他的不用变?那就把\(z_i\)的更新改了就好了。即\(z_i \leftarrow z_i + 1\)变成\(z_i\leftarrow\gamma\lambda z_i+1-\alpha\gamma\lambda z_i\)。
12.7
Generalize the three recursive equations above to their truncated versions, defining \(G_{t:h}^{\lambda s}\) and \(G_{t:h}^{\lambda a}\).
- 没太明白这个意思,直接把\(G_t\)换成\(G_{t:h}\)?
\[\begin{array}{l}
G_{t:h}^{\lambda s} = R_{t+1}+\gamma_{t+1}((1-\lambda_{t+1})\hat v(S_{t+1},\textbf{w}_t)+\lambda_{t+1}G_{t+1:h}^{\lambda s})
\\ \\
G_{t:h}^{\lambda a} = R_{t+1}+\gamma_{t+1}((1-\lambda_{t+1})\hat q(S_{t+1},A_{t+1},\textbf{w}_t)+\lambda_{t+1}G_{t+1:h}^{\lambda a})
\\ \\
G_{t:h}^{\lambda a} = R_{t+1}+\gamma_{t+1}((1-\lambda_{t+1})\overline V_t(S_{t+1})+\lambda_{t+1}G_{t+1:h}^{\lambda a})
\end{array}
\]
12.8
Prove that (12.24) becomes exact if the value function does not change. To save writing, consider the case of \(t = 0\), and use the notation \(V_k \overset{.}{=}\hat v (S_k,\textbf{w})\).
\[\begin{array}{l}
G_t^{\lambda s} - V_t = \rho_t(R_{t+1}+\gamma_{t+1}((1-\lambda_{t+1})V_{t+1}+\lambda_{t+1}G_{t+1}^{\lambda s}))+(1-\rho_t)V_t -V_t
\\ \\
\qquad \qquad = \rho_t(R_{t+1}+\gamma_{t+1} V_{t+1}-V_t +\gamma_{t+1}\lambda_{t+1}(G_{t+1}^{\lambda s}-V_{t+1}))
\\ \\
\qquad \qquad = \rho_t(\delta_t^s +\gamma_{t+1}\lambda_{t+1}(G_{t+1}^{\lambda s}-V_{t+1}))
\\ \\
\qquad \qquad = \rho_t(\delta_t^s +\gamma_{t+1}\lambda_{t+1}(\rho_{t+1}(\delta_{t+1}^s +\gamma_{t+2}\lambda_{t+2}(G_{t+2}^{\lambda s}-V_{t+2}))))
\\ \\
\qquad \qquad \cdots
\\ \\
\qquad \qquad = \rho_t\sum_{k=0}^\infty\delta_{t+k}^s\prod^k_{i=0}\gamma_{t+i}\lambda_{t+i}\rho_{t+i}
\\ \\
\Rightarrow G_t^{\lambda s}= \rho_t\sum_{k=0}^\infty\delta_{t+k}^s\prod^k_{i=0}\gamma_{t+i}\lambda_{t+i}\rho_{t+i}+V_t
\end{array}
\]
12.9
The truncated version of the general off-policy return is denoted \(G_{t:h}^{\lambda s}\). Guess the correct equation, based on (12.24).
- Guess就有点骚了,那就直接换成\(G_{t:h}\)?
\[G_{t:h}^{\lambda s} \approx \hat v(S_t,\textbf{w}_t)+\rho_t\sum_{k=t}^{h}\delta_k^s\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\]
12.10
Prove that (12.27) becomes exact if the value function does not change. To save writing, consider the case of \(t = 0\), and use the notation \(Q_k = \hat q(S_k,A_k,\textbf{w})\). Hint: Start by writing out \(\delta_0^a\) and \(G_0^{\lambda a}\), then \(G_0^{\lambda a}-Q_0\).
\[\begin{array}{l}
G_t^{\lambda a} - Q_t = R_{t+1}+\gamma_{t+1}(\overline V_t(S_{t+1})+\lambda_{t+1}\rho_{t+1}[G_{t+1}^{\lambda a}-Q_{t+1}])-Q_t
\\ \\
\qquad \qquad = R_{t+1}+\gamma_{t+1}\overline V_t(S_{t+1})-Q_t+\gamma_{t+1}\lambda_{t+1}\rho_{t+1}[G_{t+1}^{\lambda a}-Q_{t+1}]
\\ \\
\qquad \qquad = \delta_t^a+\gamma_{t+1}\lambda_{t+1}\rho_{t+1}[G_{t+1}^{\lambda a}-Q_{t+1}]
\\ \\
\qquad \qquad = \delta_t^a+\gamma_{t+1}\lambda_{t+1}\rho_{t+1}[\delta_{t+1}^a+\gamma_{t+2}\lambda_{t+2}\rho_{t+2}[G_{t+2}^{\lambda a}-Q_{t+2}]]
\\ \\
\qquad \qquad \cdots
\\ \\
\qquad \qquad = \sum_{k=0}^\infty\delta_{t+k}^a\prod_{i=0}^k\gamma_i\lambda_i\rho_i
\\ \\
\Rightarrow G_t^{\lambda a}= \sum_{k=0}^\infty\delta_{t+k}^a\prod_{i=0}^k\gamma_i\lambda_i\rho_i+Q_t
\end{array}
\]
12.11
The truncated version of the general off-policy return is denoted \(G_{t:h}^{\lambda a}\). Guess the correct equation for it, based on (12.27).
\[G_{t:h}^{\lambda a}\approx\hat q(S_t,A_t,\textbf{w}_t)+\sum_{k=t}^h\delta_k^a\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\]
12.12
Show in detail the steps outlined above for deriving (12.29) from (12.27). Start with the update (12.15), substitute \(G_t^{\lambda a}\) from (12.26) for \(G_t^\lambda\), then follow similar steps as led to (12.25).
- 仿照书上的写,由(12.27)式和 (12.15)式有
\[\begin{array}{l}
\textbf{w}_{t+1} = \textbf{w}_t +\alpha[G_t^\lambda-\hat q(S_t,A_t,\textbf{w}_t)]\nabla\hat q(S_t,A_t,\textbf{w}_t)
\\ \\
\qquad \ \approx \textbf{w}_t + \alpha\sum_{k=t}^\infty\delta_k^a\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i\nabla\hat q(S_t,A_t,\textbf{w}_t)
\end{array}
\]
则有
\[\begin{array}{l}
\sum_t^\infty(\textbf{w}_{t+1} -\textbf{w}_t) \approx \sum_t^\infty\alpha\sum_{k=t}^\infty\delta_k^a\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i\nabla\hat q(S_t,A_t,\textbf{w}_t)
\\ \\
\qquad \qquad \qquad \quad = \sum_t^\infty\sum_{k=t}^\infty\alpha\delta_k^a\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\\ \\
\qquad \qquad \qquad \quad = \sum_{k=1}^\infty\sum_{t=1}^k\alpha\delta_k^a\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\\ \\
\qquad \qquad \qquad \quad = \sum_{k=1}^\infty\alpha\delta_k^a\sum_{t=1}^k\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\end{array}
\]
令\(z_k=\sum_{t=1}^k\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i\),有
\[\begin{array}{l}
z_k=\sum_{t=1}^k\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i
\\ \\
\quad = \sum_{t=1}^{k-1}\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^k\gamma_i\lambda_i\rho_i+\nabla\hat q(S_k,A_k,\textbf{w}_k)
\\ \\
\quad = \gamma_k\lambda_k\rho_k\sum_{t=1}^{k-1}\nabla\hat q(S_t,A_t,\textbf{w}_t)\prod_{i=t+1}^{k-1}\gamma_i\lambda_i\rho_i+\nabla\hat q(S_k,A_k,\textbf{w}_k)
\\ \\
\quad = \gamma_k\lambda_k\rho_kz_{k-1}+\nabla\hat q(S_k,A_k,\textbf{w}_k)
\end{array}
\]
即有\(z_t=\gamma_t\lambda_t\rho_tz_{t-1}+\nabla\hat q(S_t,A_t,\textbf{w}_t)\)。
12.13
What are the dutch-trace and replacing-trace versions of off-policy eligibility traces for state-value and action-value methods?
- dutch-trace不知道咋推的,只能直接给出来off-policy的写法了。如果是state-value,应该是
\[\textbf{z}_t = \rho_t(\gamma_t\lambda_t\textbf{z}_{t-1}+(1-\alpha\gamma_t\lambda_t\textbf{z}_{t-1}^{\mathsf{T}}\textbf{x}_t)\textbf{x}_t)
\]
如果是action-value,应该是
\[\textbf{z}_t = \gamma_t\lambda_t\rho_t\textbf{z}_{t-1}+(1-\alpha\gamma_t\lambda_t\textbf{z}_{t-1}^{\mathsf{T}}\textbf{x}_t)\textbf{x}_t
\]
replacing-trace没有区别,还是把向量\(\textbf{z}\)拆成单个的,且有
\[z_{i,t} \overset{.}{=}
\left\{\begin{array}{l}
1,if \quad x_{i,t} =1\\
\gamma\lambda z_{i,t-1} ,otherwise
\end{array}\right.
\]
12.14
How might Double Expected Sarsa be extended to eligibility traces?
- Double的话就维护两个Q的估计\(\hat q_1(S_t,A_t,\textbf{w}_t),\hat q_2(S_t,A_t,\textbf{w}_t)\),两个\(\delta\)的计算分别由两个Q的估计算出来,一个Q的更新计算出\(\delta\)和\(\textbf{z}\)来更新另一个Q,例如用2来更新1有
\[\begin{array}{l}
\delta_{t}^a=R_{t+1}+\gamma_{t+1}\sum_a\pi_1(a|S_{t+1})\hat q_2(S_{t+1},A_{t+1},\textbf{w}_{t})-\hat q_1(S_t,A_t,\textbf{w}_t)
\\ \\
\textbf{z}_t \overset{.}{=} \gamma_t\lambda_t\pi_1(A_t|S_t)\textbf{z}_{t-1}+\nabla\hat q_1(S_t,A_t,\textbf{w}_t)
\\ \\
\textbf{w}_{t+1} \overset{.}{=} \textbf{w}_{t} +\alpha \delta_{t}^a \textbf{z}_t
\end{array}
\]
Chapter 13
13.1
Use your knowledge of the gridworld and its dynamics to determine an exact symbolic expression for the optimal probability of selecting the right action in Example 13.1.
- 目标是最大化\(v_{\pi_{\theta}}(S)\),写出来这几个状态的转移然后求状态\(S\)的最大值即可。从左到右依次为第1、2、3个状态,令选择动作右的概率为\(p\),我们有
\[\begin{array}{l}
\left\{\begin{array}{l}
v_1 = p(-1+v_2)+(1-p)(-1+v_1)
\\
v_2=p(-1+v_1)+(1-p)(-1+v_3)
\\
v_3=p(-1+0)+(1-p)(-1+v_2)
\end{array}\right.
\\ \\
\Rightarrow
\\ \\
\left\{\begin{array}{l}
v_1 = v_2-\frac{1}{p}
\\
v_2=pv_1+(1-p)v_3-1
\\
v_3=(1-p)v_2-1
\end{array}\right.
\\ \\
\Rightarrow
\\ \\
v_1 = \frac{4-2p}{p^2-p}
\end{array}
\]
求导等于0有
\[\begin{array}{l}
v_1^\prime = (\frac{4-2p}{p^2-p})^\prime = \frac{-2(p^2-p)-(4-2p)(2p-1)}{(p^2-p)^2}=0
\\ \\
\Rightarrow p= 2-\sqrt 2
\end{array}
\]
13.2
Generalize the box on page 199, the policy gradient theorem (13.5), the proof of the policy gradient theorem (page 325), and the steps leading to the REINFORCE update equation (13.8), so that (13.8) ends up with a factor of \(\gamma^t\) and thus aligns with the general algorithm given in the pseudocode.
- 先是page 199。本来是\(\eta=h(s)+\sum_{\bar s}\eta(\bar s)\sum_a\pi(a|\bar s)p(s|\bar s,a)\),这里题目说要把\(\gamma\)加进来。本来是有点奇怪的,毕竟这里是算状态访问次数的期望的,而\(\gamma\)加在reward上是最直接的,不过把\(\gamma\)加在这里就当成是一局游戏以\(1-\gamma\)的概率结束,也是说得过去的。那么有
\[\begin{array}{l}
\eta=h(s)+\gamma\sum_{\bar s}\eta(\bar s)\sum_a\pi(a|\bar s)p(s|\bar s,a)
\\ \\
\mu(s) = \frac{\eta(s)}{\sum_{s^\prime}\eta(s^\prime)}
\end{array}
\]
这个地方只拆开一步所以就一个\(\gamma\),每往后拆一步就会多一个\(\gamma\)。
接着(13.5)式。感觉还是不变的
\[\nabla J( \theta)\propto\sum_s\mu(s)\sum_aq_\pi(s,a)\nabla\pi(a|s,\theta)
\]
因为这个\(\gamma\)被放到\(\eta\)里面去了,证明下来好像没啥变化。
接着page 325证明。这里就是在展开的时候把\(\gamma\)放进去,然后等到计算\(\eta\)的时候刚好消掉,得到一样的结果。
\[\begin{array}{l}
\nabla v_\pi(s)=\nabla[\sum_a\pi(a|s)q_\pi]
\\ \\
\qquad \quad \ = \sum_a[\nabla\pi(a|s)q_\pi(s,a)+\pi(a|s)\nabla q_\pi(s,a)]
\\ \\
\qquad \quad \ = \sum_a[\nabla\pi(a|s)q_\pi(s,a)+\pi(a|s)\nabla \sum_{s^\prime,r}p(s^\prime,r|s,a)(r+{\color{red}\gamma} v_\pi(s^\prime))]
\\ \\
\qquad \quad \ = \sum_a[\nabla\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\gamma\nabla v_\pi(s^\prime)]
\\ \\
\qquad \quad \ = \sum_a[\nabla\pi(a|s)q_\pi(s,a)+\pi(a|s)\sum_{s^\prime}p(s^\prime|s,a)\gamma\sum_{a^\prime}[\nabla\pi(a^\prime|s^\prime)q_\pi(s^\prime,a^\prime)+\pi(a^\prime|s^\prime)\sum_{s^{\prime\prime}}p(s^{\prime\prime}|s^\prime,a^\prime)\gamma\nabla v_\pi(s^{\prime\prime})]]
\\ \\
\qquad \quad \ = \sum_{x\in \mathcal{S}}[\sum_{k=0}^{\infty}\text{Pr}(s\rightarrow x,k,\pi){\color{red}\gamma^k}]\sum_a\nabla\pi(a|s)q_\pi(s,a)
\\ \\
\qquad \quad \ = \sum_{s}\eta(s)\sum_a\nabla\pi(a|s)q_\pi(s,a)
\\ \\
\qquad \quad \ = \frac{\sum_{s^\prime}\eta(s^\prime)}{\sum_{s^\prime}\eta(s^\prime)}\sum_{s}\eta(s)\sum_a\nabla\pi(a|s)q_\pi(s,a)
\\ \\
\qquad \quad \ = \sum_{s^\prime}\eta(s^\prime)\sum_s\frac{\eta(s)}{\sum_{s^\prime}\eta(s^\prime)}\sum_a\nabla\pi(a|s)q_\pi(s,a)
\\ \\
\qquad \quad \ = \sum_{s^\prime}\eta(s^\prime)\sum_s\mu(s)\sum_a\nabla\pi(a|s)q_\pi(s,a)
\\ \\
\qquad \quad \ \propto \sum_s\mu(s)\sum_a\nabla\pi(a|s)q_\pi(s,a)
\end{array}
\]
式(13.8)其实和前面有区别,包括伪代码里面,要加一个\(\gamma^t\)上去,那之前\(\eta\)的定义不能变,然后\(\gamma^t\)放外面,应该可以得到
\[ \theta_{t+1} \overset{.}{=} \theta_{t}+\alpha \gamma^tG_t\frac{\nabla\pi(A_t|S_t, \theta_{t})}{\pi(A_t|S_t, \theta_{t})}
\]
这个题有点不懂,还要再看看。
13.3
In Section 13.1 we considered policy parameterizations using the soft-max in action preferences (13.2) with linear action preferences (13.3). For this parameterization, prove that the eligibility vector is
\[\nabla\ln\pi(a|s,\theta)=\textbf{x}(s,a)-\sum_b\pi(b|s,\theta)\textbf{x}(s,b)
\]
using the definitions and elementary calculus.
- 根据题目有\(\pi(a|s, \theta) =\frac{e^{ \theta^\mathsf{T} \textbf{x}(s,a)}}{\sum_be^{ \theta^\mathsf{T} \textbf{x}(s,b)}}\),取对数求导
\[\begin{array}{l}
\nabla \ln\pi(a|s,\theta)=\nabla \ln \frac{e^{ \theta^\mathsf{T} \textbf{x}(s,a)}}{\sum_be^{ \theta^\mathsf{T} \textbf{x}(s,b)}}
\\ \\
\qquad \qquad \qquad =\nabla \theta^\mathsf{T} \textbf{x}(s,a) - \nabla \ln \sum_be^{\theta^\mathsf{T} \textbf{x}(s,b)}
\\ \\
\qquad \qquad \qquad = \textbf{x}(s,a) - \frac{1}{\sum_ce^{ \theta^\mathsf{T} \textbf{x}(s,c)}}\nabla \sum_be^{ \theta^\mathsf{T} \textbf{x}(s,b)}
\\ \\
\qquad \qquad \qquad = \textbf{x}(s,a) - \frac{1}{\sum_be^{ \theta^\mathsf{T} \textbf{x}(s,b)}} \sum_b\nabla e^{ \theta^\mathsf{T} \textbf{x}(s,b)}
\\ \\
\qquad \qquad \qquad = \textbf{x}(s,a) - \frac{1}{\sum_ce^{ \theta^\mathsf{T} \textbf{x}(s,c)}} \sum_b (e^{ \theta^\mathsf{T} \textbf{x}(s,b)} \textbf{x}(s,b))
\\ \\
\qquad \qquad \qquad = \textbf{x}(s,a) - \sum_b\frac{e^{ \theta^\mathsf{T} \textbf{x}(s,b)} \textbf{x}(s,b)}{\sum_ce^{ \theta^\mathsf{T} \textbf{x}(s,c)}}
\\ \\
\qquad \qquad \qquad = \textbf{x}(s,a) - \sum_b\pi(b|s,\theta) \textbf{x}(s,b)
\end{array}
\]
13.4
Show that for the gaussian policy parameterization (13.19) the eligibility vector has the following two parts:
\[\begin{array}{l}
\nabla\ln\pi(a|s,\theta_\mu)=\frac{\nabla\pi(a|s,\theta_\mu)}{\pi(a|s,\theta)}=\frac{1}{\sigma(s,\theta)^2}(a-\mu(s,\theta))\textbf{x}_\mu(s), and
\\ \\
\nabla\ln\pi(a|s,\theta_\sigma) =\frac{\nabla\pi(a|s,\theta_\sigma)}{\pi(a|s,\theta)}=(\frac{(a-\mu(s,\theta))^2}{\sigma(s,\theta)^2}-1)\textbf{x}_\sigma(s)
\end{array}
\]
- 分别对\(\theta_\mu,\theta_\sigma\)求导
\[\begin{array}{l}
\nabla \ln\pi(a|s,\theta_\mu)=\nabla \ln\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\exp(-\frac{(a-\mu(s,\theta))^2}{2\sigma(s,\theta)^2})
\\ \\
\qquad \qquad \qquad \ =\nabla \ln\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\exp(-\frac{(a-\theta_\mu^\mathsf{T}\textbf{x}_\mu(s))^2}{2\sigma(s,\theta)^2})
\\ \\
\qquad \qquad \qquad \ =\nabla \ln\frac{1}{\sigma(s,\theta)\sqrt{2\pi}} -\frac{\nabla(a-\theta_\mu^\mathsf{T}\textbf{x}_\mu(s))^2}{2\sigma(s,\theta)^2}
\\ \\
\qquad \qquad \qquad \ =0 -\frac{2(a-\theta_\mu^\mathsf{T}\textbf{x}_\mu(s))(-\textbf{x}_\mu(s))}{2\sigma(s,\theta)^2}
\\ \\
\qquad \qquad \qquad \ =\frac{1}{\sigma(s,\theta)^2}(a-\theta_\mu^\mathsf{T}\textbf{x}_\mu(s))\textbf{x}_\mu(s)
\\ \\
\qquad \qquad \qquad \ =\frac{1}{\sigma(s,\theta)^2}(a-\mu(s,\theta))\textbf{x}_\mu(s)
\\ \\
\nabla \ln\pi(a|s,\theta_\sigma)=\nabla \ln\frac{1}{\sigma(s,\theta)\sqrt{2\pi}}\exp(-\frac{(a-\mu(s,\theta))^2}{2\sigma(s,\theta)^2})
\\ \\
\qquad \qquad \qquad \ =\nabla \ln\frac{1}{\exp(\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))\sqrt{2\pi}}\exp(-\frac{(a-\mu(s,\theta))^2}{2\exp(2\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))})
\\ \\
\qquad \qquad \qquad \ =-\nabla \theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s)-\nabla\frac{(a-\mu(s,\theta))^2}{2\exp(2\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))}
\\ \\
\qquad \qquad \qquad \ =-\textbf{x}_\sigma(s)-\frac{(a-\mu(s,\theta))^2}{2}\nabla \exp(-2\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))
\\ \\
\qquad \qquad \qquad \ =-\textbf{x}_\sigma(s)-\frac{(a-\mu(s,\theta))^2}{2}\exp(-2\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))(-2\textbf{x}_\sigma(s))
\\ \\
\qquad \qquad \qquad \ =-\textbf{x}_\sigma(s)+\frac{(a-\mu(s,\theta))}{2}\exp(\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))^{-2}(2\textbf{x}_\sigma(s))
\\ \\
\qquad \qquad \qquad \ =-\textbf{x}_\sigma(s)+\frac{(a-\mu(s,\theta))^2}{\exp(\theta_\sigma^\mathsf{T}\textbf{x}_\sigma(s))^2}\textbf{x}_\sigma(s)
\\ \\
\qquad \qquad \qquad \ =(\frac{(a-\mu(s,\theta))^2}{\sigma(s,\theta)^2}-1)\textbf{x}_\sigma(s)
\end{array}
\]
13.5
A Bernoulli-logistic unit is a stochastic neuron-like unit used in some ANNs (Section 9.6). Its input at time \(t\) is a feature vector \(\textbf{x}(S_t)\); its output, \(A_t\), is a random variable having two values, 0 and 1, with \(\text{Pr}\{A_t = 1\} = P_t\) and \(\text{Pr}\{A_t = 0\} = 1−P_t\) (the Bernoulli distribution). Let \(h(s, 0, \theta)\) and \(h(s, 1, \theta)\) be the preferences in state \(s\) for the unit’s two actions given policy parameter \(\theta\). Assume that the di↵erence between the action preferences is given by a weighted sum of the unit’s input vector, that is, assume that \(h(s, 1, \theta) − h(s, 0, \theta) = \theta ^{\mathsf{T} }\textbf{x}(s)\), where \(\theta\) is the unit’s weight vector.
( a ) Show that if the exponential soft-max distribution (13.2) is used to convert action preferences to policies, then \(P_t = \pi(1|S_t, \theta_t) = 1/(1 + \exp(−\theta_t^{\mathsf{T}}\textbf{x}(S_t)))\) (the logistic function).
( b ) What is the Monte-Carlo REINFORCE update of \(\theta_t\) to \(\theta_{t+1}\) upon receipt of return \(G_t\)?
( c ) Express the eligibility \(\nabla\ln\pi(a|s, \theta)\) for a Bernoulli-logistic unit, in terms of \(a, \textbf{x}(s)\), and \(\pi(a|s,\theta)\) by calculating the gradient.
Hint for part ( c ): Define \(P = \pi(1|s, \theta)\) and compute the derivative of the logarithm, for each action, using the chain rule on \(P\). Combine the two results into one expression that depends on \(a\) and \(P\), and then use the chain rule again, this time on \(\theta ^{\mathsf{T}}\textbf{x}(s)\), noting that the derivative of the logistic function \(f(x) = 1/(1 + e^{−x})\) is \(f(x)(1 − f(x))\).
\[\begin{array}{l}
P_t=\pi(1|S_t,\theta_t)= \frac{e^{h(s,1,\theta)}}{e^{h(s,1,\theta)}+e^{h(s,0,\theta)}}= \frac{1}{1+e^{h(s,0,\theta)-h(s,1,\theta)}}=\frac{1}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}}
\end{array}
\]
( b ) 这个没搞懂,本来更新是\(\theta_{t+1}\overset{.}{=}\theta_t+\alpha G_t\frac{\nabla\pi(A_t|S_t,\theta_t)}{\pi(A_t|S_t,\theta_t)}\),这里好像也没差吧,如果采到动作1就是\(\theta_{t+1}\overset{.}{=}\theta_t+\alpha G_t \frac{\nabla P_t}{P_t}\),如果采到动作0就是\(\theta_{t+1}\overset{.}{=}\theta_t+\alpha G_t \frac{\nabla (1-P_t)}{1-P_t}\)。这里没有把( a )中的\(P_t\)带进去。
( c )这个问题没太明白,好像只能把( a )的表达式带进去了,不然这个\(\pi\)没有一个函数表达,没法求导了。带进去有
\[\begin{array}{l}
\nabla\ln\pi(1|s,\theta)=\nabla \ln\frac{1}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}}
\\ \\
\qquad \qquad \qquad = -\nabla \ln(1+e^{-\theta^\mathsf{T}\textbf{x}(s)})
\\ \\
\qquad \qquad \qquad = -(1+e^{-\theta^\mathsf{T}\textbf{x}(s)})^{-1}e^{-\theta^\mathsf{T}\textbf{x}(s)}(-\textbf{x}(s))
\\ \\
\qquad \qquad \qquad = \frac{e^{-\theta^\mathsf{T}\textbf{x}(s)}\textbf{x}(s)}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}}
\\ \\
\qquad \qquad \qquad = (1-\frac{1}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}})\textbf{x}(s)
\\ \\
\qquad \qquad \qquad = (1-\pi(1|s,\theta))\textbf{x}(s)
\\ \\
\nabla\ln\pi(0|s,\theta)=\nabla \ln\frac{e^{-\theta^\mathsf{T}\textbf{x}(s)}}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}}
\\ \\
\qquad \qquad \qquad =-\nabla \theta^\mathsf{T}\textbf{x}(s)-\nabla \ln (1+e^{-\theta^\mathsf{T}\textbf{x}(s)})
\\ \\
\qquad \qquad \qquad =-\textbf{x}(s)+ (1-\frac{1}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}})\textbf{x}(s)
\\ \\
\qquad \qquad \qquad =\frac{1}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}}\textbf{x}(s)
\\ \\
\qquad \qquad \qquad =(1-\frac{e^{-\theta^\mathsf{T}\textbf{x}(s)}}{1+e^{-\theta^\mathsf{T}\textbf{x}(s)}})\textbf{x}(s)
\\ \\
\qquad \qquad \qquad =(1-\pi(0|s,\theta))\textbf{x}(s)
\end{array}
\]