强化学习理论-第4课-值迭代与策略迭代

1. value iteration algorithm:

值迭代上一节已经介绍过:

1.1 policy update:

1.2 Value update:

此时,\(\pi_{k+1}\)\(v_k\)都是已知的

1.3 procedure summary:

1.4 example:



2. policy iteration algorithm:



Q1:

Q2:

Q3:

2.1 Policy evaluation:

2.2 Policy improvement:


3. truncated policy iteration algorithm

3.1 compare value iteration and policy iteration:




计算一步是value interation,计算无穷多步,就是policy iteration。中间截断一步,就叫做truncated policy iteration

3.2 pseudocode:


4. summary:

posted @ 2024-11-13 11:12  penuel  阅读(24)  评论(0编辑  收藏  举报