RL 基础 | Value Iteration 的收敛性证明
(其实是专业课作业🤣 感觉算法岗面试可能会问,来存一下档)
相关链接:RL 基础 | Policy Iteration 的收敛性证明
问题:证明 Value Iteration 收敛性
Please prove the convergence of Value Iteration algorithm in discounted reward setting, and analyze the performance between resulting policy and optimal policy. 证明 Value Iteration 的收敛性。
背景:
- Policy Iteration:先得到一个 policy,然后算出它的 value function,基于该 value function 取 max_a Q(s,a) 作为新 policy,再算新 policy 的 value function…
- Value Iteration:不停对 value function 使用贝尔曼算子,新 V = max_a [r(s,a) + γV(s')]。
0 Definitions - 定义
First, we define
- Discrete state space \(S=\{s_1,\cdots,s_n\}\), discrete action space \(A=\{a_1,\cdots,a_n\}\).
- Value function: \(V(s)=E_{a\sim\pi(a|s)}\big[r(s,a) + \gamma E_{s'\sim p(s'|s,a)}V(s')\big]\) .
- Bellman operator: for all \(s\in S\) ,
The Value Iteration algorithm is to conduct Bellman operator \(B_*\) on the value function \(V(s)\). We want to prove that the sequence will converge to the optimal value function \(V^*\), which should be the of the Bellman equation \(B_*V=V\) .
我们定义了状态空间、动作空间、值函数(value function)和贝尔曼算子(Bellman operator)。VI 算法,就是对 value function 一直使用 Bellman operator,希望最后能收敛到最优的 value function。这样,该最优 value function 必然是 Bellman equation \(B_*V=V\) 的不动点。
1 Bellman operator is a Contraction Mapping - 贝尔曼算子是压缩映射
Then, we need to prove that Bellman Operator is a Contraction Mapping. The definition of Contraction Mapping is as follows:
where \(d\) is a measurement over a metric space, \(\gamma\in(0,1)\), and \(f\) is the Contraction Mapping. In our case, the definition of Contraction Mapping is as follows:
where \(||_\infty\) is the infinite norm (return the max value over all dimensions of a vector).
接下来,证明贝尔曼算子是压缩映射(Contraction Mapping)。在这里,压缩映射的定义是,两个变量的无穷范数(向量各维之差的最大绝对值)会被 \(f\) 映射压缩 \(\gamma\) 倍。
Let's prove that Bellman Operator \(B_*\) is a Contraction Mapping. The proof is as follows.
贝尔曼算子是压缩映射的证明:一连串放缩,如上。
2 Contraction Mapping converges to the optimal value function - 压缩映射能收敛到最优值函数
If Bellman Operator is a Contraction Mapping, then we can use the Banach fixed point theorem to prove that, the optimal value function \(V^*\) is the unique fixed point of the Bellman Operator \(B_*\), and the sequence \(B_*(B_*(\cdots B_*V))\) converges to \(V^*\).
如果贝尔曼算子是压缩映射,则可以用巴拿赫不动点定理(Banach fixed point theorem),证明 \(V^*\) 是 Bellman Equation \(B_*V=V\) 的唯一不动点,并且一定会收敛到该不动点。
The first step is to prove that the fixed point \(V^*\) is unique. Use proof by contradiction. Suppose there are two converged value \(V_1^*,V_2^*\), then we have \(|B_*V_1^*-B_*V_2^*|_\infty=|V_1^*-V_2^*|_\infty\). However, Bellman operator \(B_*\) is a Contraction Mapping, so there is \(|B_*V_1^*-B_*V_2^*|_\infty\le\gamma|V_1^*-V_2^*|_\infty\), which contradicts the previous one. Thus, there can be only one converged value, and it is obviously the fixed point \(V^*\).
第一步,证明收敛到的不动点是唯一的。使用反证法,假设有两个不动点,那么它们俩的压缩映射都映射到原地(不动点嘛),距离没有 γ 倍压缩,矛盾,从而得证。
The second step is to prove that the fixed point \(V^*\) really exists. If we can prove that \(\{V,B_*V,B_*^2V,\cdots\}\) is a Cauchy sequence, the fixed point \(V^*\) exists. Take two value \(B_*^a,B_*^b\) from the sequence, where \(a\gg b\). Use the triangle inequality of \(||_\infty\) as a measurement, we can get
So, if we can choose a large enough \(b\), the value of \(|B_*^a-B_*^b|_\infty\) can be less than any \(\epsilon\gt0\). Thus, there must exists an \(N\in\mathbb N\), such that \(|B_*^a-B_*^b|_\infty\lt\epsilon\), for all \(a,b\gt N\) . The sequence is a Cauchy sequence, and the fixed point \(V^*\) exists.
第二步,证明不动点 \(V^*\) 存在。去证 \(\{V,B_*V,B_*^2V,\cdots\}\) 是柯西序列(Cauchy sequence),如果是柯西序列,那么就存在收敛的不动点。柯西序列的定义是,对于序列里的 \(x_a,x_b\),存在正整数 N 使得对于任意 \(a,b\gt N\),度量 \(d(x_a,x_b)\lt\epsilon\),其中 \(\epsilon\) 是任意小的正实数。先使用三角不等式,将 \(x_a-x_b\) 的形式拆成 \((x_a-x_{a-1})+(x_{a-1}-x_{a-2})+\cdots+(x_{b+1}-x_b)\),然后使用压缩映射的性质,放缩成 γ 的等比序列求和,即可得到,它小于任意的小实数。
Reference - 参考资料
- A blog from Zhihu: https://zhuanlan.zhihu.com/p/419208786
- 非常感谢!很清晰的博客 😊🌹