RL 基础 | Value Iteration 的收敛性证明


(其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)

相关链接:RL 基础 | Policy Iteration 的收敛性证明


问题:证明 Value Iteration 收敛性

Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。

背景:

  • Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function…
  • Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。

0 Definitions - 定义

First, we define

  • Discrete state space S={s1,,sn}, discrete action space A={a1,,an}.
  • Value function: V(s)=Eaπ(a|s)[r(s,a)+γEsp(s|s,a)V(s)] .
  • Bellman operator: for all sS ,

BπV(s)Eaπ(a|s)[r(s,a)+γEsp(s|s,a)V(s)]BV(s)maxa[r(s,a)+γEsp(s|s,a)V(s)]

The Value Iteration algorithm is to conduct Bellman operator B on the value function V(s). We want to prove that the sequence will converge to the optimal value function V, which should be the of the Bellman equation BV=V .

我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation BV=V 的不动点。

1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射

Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:

d(f(x1),f(x2))γd(x1,x2),

where d is a measurement over a metric space, γ(0,1), and f is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:

|f(x1)f(x2)|γ|x1x2|,

where || is the infinite norm (return the max value over all dimensions of a vector).

接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被 f 映射压缩 γ 倍。

Let's prove that Bellman Operator B is a Contraction Mapping. The proof is as follows.

|BV1(s)BV2(s)|=|maxa[r(s,a)+γEs p(s|s,a)V1(s)]maxa[r(s,a)+γEsV2(s)]||maxa[r(s,a)+γEsV1(s)r(s,a)γEsV2(s)]|=|γmaxaEs(V1(s)V2(s))|γmaxa|Es(V1(s)V2(s))|=γmaxa|sp(s|s,a)[V1(s)V2(s)]|γmaxasp(s|s,a)|V1(s)V2(s)|=γsp(s|s,a(s))|V1(s)V2(s)|γsp(s|s,a(s))maxs|V1(s)V2(s)|=γ|V1V2|

贝尔曼算子是压缩映射的证明:一连串放缩,如上。

2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数

If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function V is the unique fixed point of the Bellman Operator B, and the sequence B(B(BV)) converges to V.

如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明 V 是 Bellman Equation BV=V 的唯一不动点,并且一定会收敛到该不动点。

The first step is to prove that the fixed point V is unique. Use proof by contradiction. Suppose there are two converged value V1,V2, then we have |BV1BV2|=|V1V2|. However, Bellman operator B is a Contraction Mapping, so there is |BV1BV2|γ|V1V2|, which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point V.

第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。

The second step is to prove that the fixed point V really exists. If we can prove that {V,BV,B2V,} is a Cauchy sequence, the fixed point V exists. Take two value Ba,Bb from the sequence, where ab. Use the triangle inequality of || as a measurement, we can get

|BaBb||BaBa1|+|Ba1Bb||BaBa1|+|Ba1Ba2|+|Ba2Bb||BaBa1|++|Bb+1Bb|γb|BVV|i=0ab1γiγb1γ|BVV|.

So, if we can choose a large enough b, the value of |BaBb| can be less than any ϵ>0. Thus, there must exists an NN, such that |BaBb|<ϵ, for all a,b>N . The sequence is a Cauchy sequence, and the fixed point V exists.

第二步,证明不动点 V 存在。去证 {V,BV,B2V,} 是柯西序列(Cauchy sequence),如果是柯西序列,那么就存在收敛的不动点。柯西序列的定义是,对于序列里的 xa,xb,存在正整数 N 使得对于任意 a,b>N,度量 d(xa,xb)<ϵ,其中 ϵ 是任意小的正实数。先使用三角不等式,将 xaxb 的形式拆成 (xaxa1)+(xa1xa2)++(xb+1xb),然后使用压缩映射的性质,放缩成 γ 的等比序列求和,即可得到,它小于任意的小实数。

Reference - 参考资料



本文作者:MoonOut

本文链接:https://www.cnblogs.com/moonout/p/17783506.html

版权声明:本作品采用知识共享署名-非商业性使用-禁止演绎 2.5 中国大陆许可协议进行许可。

posted @   MoonOut  阅读(720)  评论(0编辑  收藏  举报
历史上的今天:
2021-10-24 刷题记录:LC1997-访问完所有房间的第一天
点击右上角即可分享
微信分享提示
评论
收藏
关注
推荐
深色
回顶
收起
  1. 1 Sibelius: Violin Concerto in D Minor, Op. 47:III. Allegro, ma non tanto Jascha Heifetz / Chicago Symphony Orchestra
Sibelius: Violin Concerto in D Minor, Op. 47:III. Allegro, ma non tanto - Jascha Heifetz / Chicago Symphony Orchestra
00:00 / 00:00
An audio error has occurred.