强化学习代码之4×4网格问题
前言
主要参考的是《Reinforcement Learning: An introduction Second Edition》这本书里的例子
英文版地址:http://incompleteideas.net/book/first/ebook/the-book.html
代码源文件可以参考这篇回答:https://zhuanlan.zhihu.com/p/79701922 采用Matlab实现
问题描述
该问题主要是进行策略评估,得出等概率随机选择的策略下的状态值函数
问题描述:某一智能体在4×4的网格中行走,非终点状态为{1,2,..,14},终点状态为左上角和右下角的两个格子。在任意非终点状态下,可能动作为上下左右。走出网格,则待在原地不动;到达终点状态后,游戏结束。在某一时间步骤,所有可能转移得到的立即奖赏为-1。
** | 1 | 2 | 3 |
---|---|---|---|
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | ** |
问题假设:游戏模型简洁明了,无其它假设。 | |||
问题分析:状态(16个方格,有两个状态是终点状态,值函数不更新 其余变负即可)、动作(上下左右,注意边际地区的位置不变 其余行加一列加一)、转移规则 |
代码实现
策略迭代一般分为“双矩阵”迭代策略评估、“原位更新”迭代策略评估。重点依然是贝尔曼方程的应用,主要区别在于“双矩阵”迭代策略评估先利用之前的值把所有的都更新一次再进行下一步的计算,“原位更新”迭代策略评估直接将上一次更新的值应用在之后的估计中(直接原位覆盖)。就是\(Q\)矩阵存一个还是两个的区别。原位更新收敛速度比双矩阵的更新方式快。
代码和之前那个Gridworld很相似,甚至更加简单了一点。
#######################################################################
# Copyright (C) #
# 2016 Shangtong Zhang(zhangshangtong.cpp@gmail.com) #
# 2016 Kenta Shimada(hyperkentakun@gmail.com) #
# Permission given to modify the code as long as you keep this #
# declaration at the top #
#######################################################################
from __future__ import print_function
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.table import Table
WORLD_SIZE = 4 #网格尺寸
REWARD = -1.0 #奖励
ACTION_PROB = 0.25 #选择各动作的概率
world = np.zeros((WORLD_SIZE, WORLD_SIZE))
# left, up, right, down 动作集
actions = ['L', 'U', 'R', 'D']
#定义状态转换
nextState = []
for i in range(0, WORLD_SIZE):
nextState.append([])
for j in range(0, WORLD_SIZE):
next = dict()
if i == 0:
next['U'] = [i, j]
else:
next['U'] = [i - 1, j]
if i == WORLD_SIZE - 1:
next['D'] = [i, j]
else:
next['D'] = [i + 1, j]
if j == 0:
next['L'] = [i, j]
else:
next['L'] = [i, j - 1]
if j == WORLD_SIZE - 1:
next['R'] = [i, j]
else:
next['R'] = [i, j + 1]
nextState[i].append(next)
#状态空间
states = []
for i in range(0, WORLD_SIZE):
for j in range(0, WORLD_SIZE):
if (i == 0 and j == 0) or (i == WORLD_SIZE - 1 and j == WORLD_SIZE - 1):
continue
else:
states.append([i, j])
#画表格函数
def draw_image(image):
fig, ax = plt.subplots()
ax.set_axis_off()
tb = Table(ax, bbox=[0,0,1,1])
nrows, ncols = image.shape
width, height = 1.0 / ncols, 1.0 / nrows
# Add cells
for (i,j), val in np.ndenumerate(image):
# Index either the first or second item of bkg_colors based on
# a checker board pattern
idx = [j % 2, (j + 1) % 2][i % 2]
color = 'white'
tb.add_cell(i, j, width, height, text=val,
loc='center', facecolor=color)
# Row Labels...
for i, label in enumerate(range(len(image))):
tb.add_cell(i, -1, width, height, text=label+1, loc='right',
edgecolor='none', facecolor='none')
# Column Labels...
for j, label in enumerate(range(len(image))):
tb.add_cell(-1, j, width, height/2, text=label+1, loc='center',
edgecolor='none', facecolor='none')
ax.add_table(tb)
plt.show()
# for figure 4.1
while True:
# keep iteration until convergence
newWorld = np.zeros((WORLD_SIZE, WORLD_SIZE))
for i, j in states:
for action in actions:
newPosition = nextState[i][j][action]
# bellman equation
newWorld[i, j] += ACTION_PROB * (REWARD + world[newPosition[0], newPosition[1]])
if np.sum(np.abs(world - newWorld)) < 1e-4:
print('Random Policy')
draw_image(np.round(newWorld, decimals=1))
print(newWorld)
break
print(newWorld)
world = newWorld