【RL】CH2-Bellman equation
the discounted return
\[\begin{aligned}
G_t & =R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\ldots \\
& =R_{t+1}+\gamma\left(R_{t+2}+\gamma R_{t+3}+\ldots\right) \\
& =R_{t+1}+\gamma G_{t+1}
\end{aligned}
\]
state-value function/the state value of s \(v_\pi(s)\)
\[\begin{aligned}
v_\pi(s) & =\mathbb{E}\left[G_t \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1}+\gamma G_{t+1} \mid S_t=s\right] \\
& =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right]
\end{aligned}
\]
Bellman Equation
\[\begin{aligned}
v_\pi(s) & =\mathbb{E}\left[R_{t+1} \mid S_t=s\right]+\gamma \mathbb{E}\left[G_{t+1} \mid S_t=s\right], \\
& =\underbrace{\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{r \in \mathcal{R}} p(r \mid s, a) r}_{\text {mean of immediate rewards }}+\underbrace{\gamma \sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right),}_{\text {mean of future rewards }} \\
& =\sum_{a \in \mathcal{A}} \pi(a \mid s)\left[\sum_{r \in \mathcal{R}} p(r \mid s, a) r+\gamma \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right) v_\pi\left(s^{\prime}\right)\right], \quad \text { for all } s \in \mathcal{S} .
\end{aligned}
\]
two equivalent expressions
First
First, it follows from the law of total probability that
\[\begin{aligned}
& p\left(s^{\prime} \mid s, a\right)=\sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right), \\
& p(r \mid s, a)=\sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime}, r \mid s, a\right) .
\end{aligned}
\]
Then, equation (2.7) can be rewritten as
\[v_\pi(s)=\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} \sum_{r \in \mathcal{R}} p\left(s^{\prime}, r \mid s, a\right)\left[r+\gamma v_\pi\left(s^{\prime}\right)\right]
\]
Second
Second, the reward \(r\) may depend solely on the next state \(s^{\prime}\) in some problems. As a result, we can write the reward as \(r\left(s^{\prime}\right)\) and hence \(p\left(r\left(s^{\prime}\right) \mid s, a\right)=p\left(s^{\prime} \mid s, a\right)\), substituting which into \((2.7)\) gives
\[v_\pi(s)=\sum_{a \in \mathcal{A}} \pi(a \mid s) \sum_{s^{\prime} \in \mathcal{S}} p\left(s^{\prime} \mid s, a\right)\left[r\left(s^{\prime}\right)+\gamma v_\pi\left(s^{\prime}\right)\right]
\]