RL 基础 | Policy Iteration 的收敛性证明


(其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)

相关链接:RL 基础 | Value Iteration 的收敛性证明


问题:证明 Policy Iteration 收敛性

Please prove that Policy Iteration algorithm could terminate within finite steps under discrete state, discrete action and discounted reward settings.

Answer:

0 Background - 背景

First of all, let's review what is Policy Iteration. It includes two steps:

  • 1 - Policy Evaluation:
    • For an initial policy π1 and an initial value function V0, we use Bellman Operator Bπ1V(s)=Eaπ(a|s)[r(s,a)+γEsp(s|s,a)V(s)] to get the accurate value function V1 for policy π1.
    • In practice, we repeatedly use the Bellman operator Bπ1 to update the value function the from its initial value V0, until it reaches Bπ1V(s)=V(s) for all s. We denote the V satisfying the above equation as V1, so V1 is the corresponding value function of policy π1.
  • 2 - Policy Improvement:
    • We can get a policy π2 better than the previous policy π1 use its value function V1. For all state s, π2(s)=argmaxa[r(s,a)+γEsp(s|s,a)V1(s)].
  • Iteration:
    • After getting π2, we calculate its value function V2 using Bellman Operator Bπ2; Then, we can get a better policy π3 using value function V2 ... Finally, the iterated policy πk+1 will be the same as its previous policy πk, and at that time, we get the optimal policy πk.

在最开始,先回顾一下 Policy Iteration(策略迭代)的定义:

  • 它包含两部分:1. Policy Evaluation(策略评估),2. Policy Improvement(策略改进)。
  • 第一步是去求解给定策略的 value function,第二步是基于该 value function 做 a = argmax [r+γV(s')],得到一个更好的新策略。
  • 这样不断迭代,不断得到更好的新策略;如果某次迭代,新策略 = 上一次的策略,那么策略收敛到了最优策略。

In the following sections, we will demonstrate two keypoints for the convergence guarantee of Policy Iteration:

  • First, the value function will converge to the value function of the given policy during the Policy Evaluation.
  • Second, we can use finite Policy Iteration steps to get the optimal policy.

接下来,我们会证明两件事情:1. 策略评估环节的值函数真的能收敛,2. 策略改进环节能通过有限次迭代得到最优策略。

1 Policy Evaluation converges to the value function of the given policy - 策略评估的值函数会收敛到给定策略的值函数

What we want to prove is that, through the Policy Evaluation ,we can always get the corresponding value function Vi for a specific policy πi. To prove this, we have to point out that, the Bellman Operator Bπ is a Contraction Mapping. The proof is as follows (similar to homework 4):

(1)|BπV1(s)BπV2(s)|=|Eaπ(a|s)[r(s,a)+γEsp(s|s,a)[V1(s)]r(s,a)γEsp(s|s,a)[V2(s)]]|=γ|Eaπ(a|s)Esp(s|s,a)[V1(s)V2(s)]|γ|Eaπ(a|s),sp(s|s,a)maxs[V1(s)V2(s)]|=γ|V1V2|

If the Bellman Operator Bπ is a Contraction Mapping, we can use the Banach fixed point theorem to obtain the convergence guarantee. It is the same as the proof in homework 4, so won't go into detail here.

我们希望证明,策略评估真的能得到特定策略的 value function。需要证明 Bellman Operator Bπ 是压缩映射(Contraction Mapping)(通过一通放缩就能得到了),然后使用巴纳赫不动点定理,即可得到 Policy Evaluation 的收敛保证(见 上次作业)。

2 Policy Improvement will converge to the optimal policy - 策略改进的策略会收敛到最优策略

Then, we need to prove the effectiveness of Policy Improvement. The proof includes two parts: 1. Policy Improvement will get a better policy than the previous one; 2. It will finally converge to the optimal policy after finite steps.

证明完 Policy Evaluation 能得到 value function 后,我们要去证明 Policy Improvement 的有效性。证明分为两步:1. 证明每次 Policy Improvement 都能得到更好的策略;2. 证明有限步 Policy Improvement 后就能收敛到最优策略。

2.1 Policy Improvement will push the policy better - 策略改进总能让策略性能提升

To show this, we need to unfold the iteration process that repeatedly conducts the Bellman Operator Bπi+1 in the value function Vi+1 using the previous value function Vi.

想要证明 Policy Improvement 总能通过 a = argmax Q(s,a) 得到更好的策略(或至少不会变的更差),我们想证明的是,新策略 πi+1 的 value function Vi+1 大于等于旧策略 πi 的 value function Vi。为此,我们需要把 Policy Evaluation 求解 Vi+1 的过程中,反复使用 Bellman Operator Bπi+1 的过程进行展开。

Consider the following policy: we use the improved policy πi+1 for the current step, and then use policy πi for the remaining episode, then the value function will be:

(2)Vi+1(1)(s)=argmaxa[r(s,a)+Esp(s|s,a)γV(s)]Vi(s).

Then, at the state s, we continue to use the improved policy πi+1, changing value function into

(3)Vi+1(2)(s)=argmaxa[r(s,a)+Es;aπi+1(s)[r(s,a)+EsVi(s)]]Vi+1(1)Vi(s).

Now, we can infer that, if we continue to use the improved policy πi+1 till the episode ends, we can get the value function Vi+1 which satisfies the following inequality:

(4)Vi+1(s)=Vi+1(m)(s)Vi+1(m1)(s)Vi+1(1)(s)Vi(s).

Thus, we obtain the conclusion that, the performance of the improved policy is better or no worse than the previous one.

  • 先考虑这样一种策略:在当前决策,我们使用新策略 πi+1 得到新 state s,然后继续使用旧策略 πi。这样得到的 value function Vi+1(1)(s) 如上文的公式 (2) 所示。
  • 然后,我们使用两步新策略 πi+1 ,也就是在得到新 state s 后继续使用 πi+1 ,这样得到的 value function Vi+1(2)(s) 如上文的公式 (3) 所示,有 Vi+1(2)(s)Vi+1(2)(s)
  • 这样,一直使用新策略 πi+1,无穷无尽地继续下去,就能得到新策略 πi+1 的 value function Vi+1(s) !可以得到公式 (4) 的不等式,即,新 value function 一定大于等于旧 value function,得证。

2.2 Policy Improvement will converge to the optimal policy - 策略改进会收敛到最优策略

It is actually very simple: If the state space and action space are all discrete and finite, then we have a finite number of policies. If the policy cannot converge to the optimal policy, then there is only one possibility called "policy oscillation", which means that, when we get policy πa and its value function Va, the improved policy based on Va is πb; we get policy πb and its value function Vb, the improved policy based on Vb turns back to πa, (or more oscillated policies πa,πb,πc,).

证明 Policy Improvement 会收敛到最优策略,其实非常简单:因为 state space 和 action space 都是离散、有限的,因此策略的数量也有限,一直迭代,总会收敛到最优策略。除非碰到了“策略震荡”(policy oscillation)的情况:策略一直在比如说 πaπb 间震荡(当然也可能在更多策略间震荡),πa Policy Improvement 得到 πbπb Policy Improvement 得到 πa,如此循环往复。

However, it contradicts with the guarantee that Policy Improvement can always get a better (or no worse) policy. If the improved policy is better, we will get Va<Vb<Va, which is obviously wrong. If the improved policy is as good as the previous one, then the oscillated policies are all optimal policies, so we have obtained the optimal policy.

然而,“策略震荡”现象与 2.1 节所说,Policy Improvement 一定能得到更好的策略(或至少不更差的策略)相矛盾。如果能得到更好的策略,那么 πa,πb 的 value function 满足 Va<Vb<Va,显然是不对的。如果这些策略一样好,那么它们已经是最优策略了,我们就得到了最优策略。

Reference - 参考资料



本文作者:MoonOut

本文链接:https://www.cnblogs.com/moonout/p/17804874.html

版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。

posted @   MoonOut  阅读(317)  评论(3编辑  收藏  举报
点击右上角即可分享
微信分享提示
评论
收藏
关注
推荐
深色
回顶
收起
  1. 1 Sibelius: Violin Concerto in D Minor, Op. 47:III. Allegro, ma non tanto Jascha Heifetz / Chicago Symphony Orchestra
Sibelius: Violin Concerto in D Minor, Op. 47:III. Allegro, ma non tanto - Jascha Heifetz / Chicago Symphony Orchestra
00:00 / 00:00
An audio error has occurred.