POLAYOR

贝尔曼公式

贝尔曼公式

Calculating return

  1. Direct calculate
  2. Bootstrapping (returns rely on each other)

Bellman equation

  • Calculate returns in bootstrapping
  • Matrix-vector form is expected

State value

\[A_t (S_t)=R_{t+1}, S_{t+1} \quad , \quad A_{t+1} (S_{t+1})=R_{t+2}, S_{t+2} \quad , \quad \dots \]

\[G_t = R_{t+1}+\gamma R_{t+2} + \gamma ^2 R_{t+3} + \dots \]

\[v_\pi (s) = \mathbb{E}[G_t | S_t = s], \text{ can also be expressed as } v(s, \pi) \]

  • It is a function of s .
  • It is based on the policy \pi .

Deriving the Bellman equation

\[G_t = R_{t+1}+\gamma G_{t+1}, \\ \\ \begin{align} v_\pi (s) &= \mathbb{E}[G_t | S_t = s] \\ &=\mathbb{E}[R_{t+1} | S_t = s]+ \gamma \mathbb{E}[G_{t+1} | S_t = s] \end{align} \]

\[\begin{align} \mathbb{E}[R_{t+1} | S_t = s] &= \Sigma_a \pi (a|s) \mathbb{E}[R_{t+1} | S_t=s, A_t=a] \\ &= \Sigma_a \pi (a|s) \Sigma_r p(r|s,a)r \end{align} \]

\[\begin{align} \mathbb{E}[G_{t+1} | S_t = s] &= \Sigma_{s'}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s) \\ &= \Sigma_{s'}v_\pi (s')p(s'|s) \\ &= \Sigma_{s'}v_\pi (s') \Sigma_a p(s'|s,a) \pi (a|s) \end{align} \]

Therefore, we have

\[v_\pi (s) = \Sigma_a \pi (a|s) [\Sigma_r p(r|s,a)r+\gamma \Sigma_{s'}p(s'|s,a)v_\pi(s')], \quad \forall s \in S. \]

  • 贝尔曼公式使用n个式子求解,一般只写上述这一个式子
  • 贝尔曼公式依赖于policy
  • 后面几个章节都默认已知dynamic model(environment model),对于未知的model会在未来学习

Matrix-vector form of the Bellman equation

Rewrite the Bellman equation as

\[v_\pi (s) = r_\pi (s) + \gamma \Sigma_{s'} p_\pi (s'|s)v_\pi (s') \]

Add the index

\[v_\pi (s_i) = r_\pi (s_i) + \gamma \Sigma_{s_j} p_\pi (s_j|s_i)v_\pi (s_j) \]

\[\underbrace{ \left [ \begin{matrix} v_\pi (s_1) \\ v_\pi (s_2) \\ v_\pi (s_3) \\ v_\pi (s_4) \end{matrix} \right ] }_{v_\pi} = \underbrace{ \left [ \begin{matrix} r_\pi (s_1) \\ r_\pi (s_2) \\ r_\pi (s_3) \\ r_\pi (s_4) \end{matrix} \right ] }_{r_\pi} + \gamma \underbrace{ \left [ \begin{matrix} p_\pi (s_1|s_1) & p_\pi (s_2|s_1) & p_\pi (s_3|s_1) & p_\pi (s_4|s_1) \\ p_\pi (s_1|s_2) & p_\pi (s_2|s_2) & p_\pi (s_3|s_2) & p_\pi (s_4|s_2) \\ p_\pi (s_1|s_3) & p_\pi (s_2|s_3) & p_\pi (s_3|s_3) & p_\pi (s_4|s_3) \\ p_\pi (s_1|s_4) & p_\pi (s_2|s_4) & p_\pi (s_3|s_4) & p_\pi (s_4|s_4) \end{matrix} \right ] }_{P_\pi} \underbrace{ \left [ \begin{matrix} v_\pi (s_1) \\ v_\pi (s_2) \\ v_\pi (s_3) \\ v_\pi (s_4) \end{matrix} \right ] }_{v_\pi} \]

Closed-form solution

\[v_\pi = (I-\gamma P_\pi)^{-1} r_\pi \]

iterative solution

\[v_{k+1}=r_\pi +\gamma P_\pi v_k, \\ v_k \to v_\pi = (I-\gamma P_\pi)^{-1} r_\pi, \quad k \to \infin \]

证明方法:定义 v_k 和 v_\pi 的差

Action value

\[q_\pi (s,a)=\mathbb{E}[G_t|S_t=s,A_t=a] \]

在s状态进行a动作

  • a function of the state-action pair (s, a)
  • depends on \pi

Relation to state value

\[\underbrace{\mathbb{E}[G_t|S_t=s]}_{v_\pi (s)} = \Sigma_a \underbrace{\mathbb{E}[G_t|S_t = s, A_t = a]}_{q_\pi (s,a)} \pi (a|s) \]

Compare state value (v_ \pi) and action value(q_ \pi) , we have

\[q_\pi (s,a)=\Sigma_r p(r|s,a)r+\gamma \Sigma_{s'}p(s'|s,a)v_\pi(s') \]

posted on 2023-06-10 11:00  POLAYOR  阅读(64)  评论(0编辑  收藏  举报

导航