Calculating return
- Direct calculate
- Bootstrapping (returns rely on each other)
Bellman equation
- Calculate returns in bootstrapping
- Matrix-vector form is expected
State value
\[A_t (S_t)=R_{t+1}, S_{t+1} \quad , \quad A_{t+1} (S_{t+1})=R_{t+2}, S_{t+2} \quad , \quad \dots
\[G_t = R_{t+1}+\gamma R_{t+2} + \gamma ^2 R_{t+3} + \dots
\[v_\pi (s) = \mathbb{E}[G_t | S_t = s], \text{ can also be expressed as } v(s, \pi)
- It is a function of s .
- It is based on the policy \pi .
Deriving the Bellman equation
\[G_t = R_{t+1}+\gamma G_{t+1}, \\
v_\pi (s) &= \mathbb{E}[G_t | S_t = s] \\
&=\mathbb{E}[R_{t+1} | S_t = s]+ \gamma \mathbb{E}[G_{t+1} | S_t = s]
\mathbb{E}[R_{t+1} | S_t = s] &= \Sigma_a \pi (a|s) \mathbb{E}[R_{t+1} | S_t=s, A_t=a] \\
&= \Sigma_a \pi (a|s) \Sigma_r p(r|s,a)r
\mathbb{E}[G_{t+1} | S_t = s] &= \Sigma_{s'}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s) \\
&= \Sigma_{s'}v_\pi (s')p(s'|s) \\
&= \Sigma_{s'}v_\pi (s') \Sigma_a p(s'|s,a) \pi (a|s)
Therefore, we have
\[v_\pi (s) = \Sigma_a \pi (a|s) [\Sigma_r p(r|s,a)r+\gamma \Sigma_{s'}p(s'|s,a)v_\pi(s')], \quad \forall s \in S.
- 贝尔曼公式使用n个式子求解,一般只写上述这一个式子
- 贝尔曼公式依赖于policy
- 后面几个章节都默认已知dynamic model(environment model),对于未知的model会在未来学习
Matrix-vector form of the Bellman equation
Rewrite the Bellman equation as
\[v_\pi (s) = r_\pi (s) + \gamma \Sigma_{s'} p_\pi (s'|s)v_\pi (s')
Add the index
\[v_\pi (s_i) = r_\pi (s_i) + \gamma \Sigma_{s_j} p_\pi (s_j|s_i)v_\pi (s_j)
\left [
v_\pi (s_1) \\
v_\pi (s_2) \\
v_\pi (s_3) \\
v_\pi (s_4)
\right ]
\left [
r_\pi (s_1) \\
r_\pi (s_2) \\
r_\pi (s_3) \\
r_\pi (s_4)
\right ]
\left [
p_\pi (s_1|s_1) & p_\pi (s_2|s_1) & p_\pi (s_3|s_1) & p_\pi (s_4|s_1) \\
p_\pi (s_1|s_2) & p_\pi (s_2|s_2) & p_\pi (s_3|s_2) & p_\pi (s_4|s_2) \\
p_\pi (s_1|s_3) & p_\pi (s_2|s_3) & p_\pi (s_3|s_3) & p_\pi (s_4|s_3) \\
p_\pi (s_1|s_4) & p_\pi (s_2|s_4) & p_\pi (s_3|s_4) & p_\pi (s_4|s_4)
\right ]
\left [
v_\pi (s_1) \\
v_\pi (s_2) \\
v_\pi (s_3) \\
v_\pi (s_4)
\right ]
Closed-form solution
\[v_\pi = (I-\gamma P_\pi)^{-1} r_\pi
iterative solution
\[v_{k+1}=r_\pi +\gamma P_\pi v_k, \\
v_k \to v_\pi = (I-\gamma P_\pi)^{-1} r_\pi, \quad k \to \infin
证明方法:定义 v_k 和 v_\pi 的差
Action value
\[q_\pi (s,a)=\mathbb{E}[G_t|S_t=s,A_t=a]
- a function of the state-action pair (s, a)
- depends on \pi
Relation to state value
\[\underbrace{\mathbb{E}[G_t|S_t=s]}_{v_\pi (s)} = \Sigma_a \underbrace{\mathbb{E}[G_t|S_t = s, A_t = a]}_{q_\pi (s,a)} \pi (a|s)
Compare state value (v_ \pi) and action value(q_ \pi) , we have
\[q_\pi (s,a)=\Sigma_r p(r|s,a)r+\gamma \Sigma_{s'}p(s'|s,a)v_\pi(s')