强化学习Q=learning ——Reinforcement Learning Solution to the Towers of Hanoi Puzzle
我们的目标是书写强化学习-Q learning的代码,然后利用代码解决汉诺塔问题
强化学习简介
基础的详细定义之类的,就不再这里赘述了。下面直接说一些有用的东西。
强化学习的步骤:
-
对于每个状态,对这个状态下,所有的动作,计算这个状态-动作的潜在奖励。
- 一般记录在Q表格中,可以表示为 \(Q[(state,move):value]\)
-
对于汉诺塔问题,由于我们能达到最终的目标,所以这里设置最终的
reinforcement
(\(r\)) = 1 -
对于强化学习,我们的选择动作有两种策略(注:同的选择所对应的更新Q表格的方程不同)
- 一,每次选择最小的,更小的值,代表离目标更近。
- 二,每次选择更大的,更大的值,代表离目标更近。
- 这里我们设目标为1,同时使用更小值作为选择动作的方式。选择方程如下,
- $ a_t^o = \mathop{\arg\min}_{a} Q(s_t,a).$
- 其中\(a_t\)为选择的动作,\(s_t\)为当前状态,可以解释为,\(s_t\)下,有若干的动作\(a\),选择Q最小的动作\(a_t\)
-
现在考虑Q表格的更新问题
-
对于Q表格的更新,我们采取下面两种方程。(r=1)(注意:这里我们会初始化所有的Q为0,接着再根据状态-动作进行更新)
-
如果达到目标
\[ \begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r - Q(s_t,a_t)) \end{align*} \]- 或者直接赋值为1,表示到达目标,这里为了计算简单,直接赋值为1。
-
其他情况
\[\begin{align*} Q(s_t,a_t) = Q(s_t,a_t) + \rho (r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t)) \end{align*} \] -
理解上述方程:
- 首先,在\(s_t\)下,我们根据Q表格值,选取了动作\(a_t\),运动到\(s_{t+1}\)。
- 在\(s_{t+1}\)下,我们首先做的是更新上一个\(s_t\)下,动作\(a_t\)的Q值。
- 这时,我们根据Q表格可以有\(s_{t+1}\)下,\(a_{t+1}\)的值,并且,我们有目标奖励
reinforcement
(\(r\)) = 1 - 这里,我们把\((r + Q(s_{t+1},a_{t+1}))\)看作实际\(s_t\)下,动作\(a_t\)的Q值,同时估计值是\(Q(s_t,a_t)\)。
- 因此,\((r + Q(s_{t+1},a_{t+1}) - Q(s_t,a_t))\) 可以是估计值与实际值的差值,再乘以学习率\(\rho\),表示每次学习的差值。
- 最后,把差值累加上原有的估计值\(Q(s_t,a_t)\),即为更新后的\(Q(s_t,a_t)\)
-
以上为对于基本强化学习的解释。
需要完成的事
-
首先,我们要把汉诺塔问题可视化。方便观察运行结果,与过程。
-
简单来看,我们可以用
[[1, 2, 3], [], []]
表示一个一个状态,三个小的[]
表示三根塔柱,数字表示三个塔盘,其中大小表示塔盘的不同大小。 -
对于移动塔盘的动作,也可以简化为单个
[1, 2]
,或者(1, 2)
,表示为,把一号塔柱,上的塔盘移动到二号塔柱上,(从左到右依次1,2,3) -
那么我们可以书写一下四个方程:
printState(state)
: 打印塔的状态,便于可视化validMoves(state)
: 返回当前state
下的所有的可行动作makeMove(state, move)
: 返回根据move(action)
移动后的state
stateMoveTuple(state, move)
:state
与move(action)
需要更改以为tuple格式,即(state,move)
,因为,这里我们把Q表格更改字典型存储,这样比较简单
-
接下来书写
epsilonDecayFactor
方程- 此方程的功能为:随机一个数,如果这个数小于我们预设的
epsilon
,那么就随机一个动作。如果大于,就从Q表格中选择Q值最小的动作运动。 - 对于epsilonGreedy 方程(
If np.random.uniform()<epsilon
)来说,小的epsilon
意味着,更多可能会使用Q表格选取动作。太大的epsilon
会导致无法收敛的问题。对于本次题目来说,加入\(epsilon*=espsilonDecayFactor\) 来不断减小epsilon
的值,使其趋向于0。
- 此方程的功能为:随机一个数,如果这个数小于我们预设的
-
trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState)
- 根据start与goal状态,训练Q表格,得到一个合理的Q表格
- 以下为trainQ的伪代码:
- 初始化 Q.
- Repeat:
- Use epsilonGreedy function to get the action and get the stateNew
- If (stateNew,move) not in Q,
- Update Qold = 0
- If stateNew is goalState,
- Update Qold = 1
- Otherwise (not at goal),
- If not first step, update Qold = Qold + rho * (1 + Qnew - Qold)
- Shift current state and action to old ones.
-
testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState)
-
选择需要的start与goal状态,自动根据Q表格中的值,选择最优的移动策略
-
以下为testQ的伪代码:
-
get Q from the trainQ;
-
Repeat:
- Use
validMoves
function to get the action list; - Use the Q table to get value of \((state,action)\);
if the action is not in Q, set the value is infinity
- Choose the action by the \(argmin Q[(state,move)]\)
- Record the action and state in path
- If at goal;
return path
- If step > maxStep;
return 'Goal not reached in maxSteps'
- Use
-
-
Code & Test
import numpy as np
import random
import matplotlib.pyplot as plt
import copy
%matplotlib inline
def stateModify(state):
N = 3
row = []
stateModify = []
collums= len(state)
stateCopy = copy.copy(state)
for i in range(collums):
row.append(len(state[i]))
# add 0 in modified state
for i in range (collums):
while row[i] < N:
stateCopy[i].insert(0,0)
row[i]= len(stateCopy[i])
# set it as modify state
for i in range(max(row)):
for j in range(len(stateCopy)):
stateModify.append(stateCopy[j][i])
return(stateModify)
def printState(state):
statePrint = stateModify(state)
# print the state
i = 0
for num in statePrint:
# if the number is zero, we print ' '
if num == 0:
print(" ",end=" ")
else:
print(num, end=" ")
i += 1
if i%3 == 0:
print("")
print('------')
def validMoves(state):
actions = []
# check left
if state[0] != []:
# left to middle
if state[1]==[] or state[0][0] < state[1][0]:
actions.append([1,2])
# left to right
if state[2]==[] or state[0][0] < state[2][0]:
actions.append([1,3])
# check middle
if state[1] != []:
# middle to left
if state[0]==[] or state[1][0] < state[0][0]:
actions.append([2,1])
# middle to right
if state[2]==[] or state[1][0] < state[2][0]:
actions.append([2,3])
# check right
if state[2] != []:
# right to left
if state[0]==[] or state[2][0] < state[0][0]:
actions.append([3,1])
# right to middle
if state[1]==[] or state[2][0] < state[1][0]:
actions.append([3,2])
return actions
def stateMoveTuple(state, move):
stateTuple = []
returnTuple = [tuple(move)]
for i in range (len(state)):
stateTuple.append(tuple(state[i]))
returnTuple.insert(0,tuple(stateTuple))
return tuple(returnTuple)
def makeMove(state, move):
stateMove = []
stateMove = copy.deepcopy(state)
stateMove[move[1]-1].insert(0,stateMove[move[0]-1][0])
stateMove[move[0]-1].pop(0)
return stateMove
def epsilonGreedy(Q, state, epsilon, validMovesF):
validMoveList = validMoves(state)
if np.random.uniform() < epsilon:
# Random Move
lens = len(validMoveList)
return validMoveList[random.randint(0,lens-1)]
else:
# Greedy Move
Qs = np.array([Q.get(stateMoveTuple(state, m), 0) for m in validMoveList])
return validMoveList[np.argmin(Qs)]
def trainQ(nRepetitions, learningRate, epsilonDecayFactor, validMovesF, makeMoveF,startState,goalState):
epsilon = 1.0
outcomes = np.zeros(nRepetitions)
Q = {}
for nGames in range(nRepetitions):
epsilon *= epsilonDecayFactor
step = 0
done = False
state = copy.deepcopy(startState)
while not done:
step += 1
move = epsilonGreedy(Q, state, epsilon, validMovesF)
stateNew = makeMoveF(state,move)
if stateMoveTuple(state, move) not in Q:
Q[stateMoveTuple(state, move)] = 0
if stateNew == goalState:
# Q[stateMoveTuple(state, move)] += learningRate * (1 - Q[stateMoveTuple(state, move)])
Q[stateMoveTuple(state, move)] = 1
done = True
outcomes[nGames] = step
else:
if step > 1:
Q[stateMoveTuple(stateOld, moveOld)] += learningRate * \
(1 + Q[stateMoveTuple(state, move)] - Q[stateMoveTuple(stateOld, moveOld)])
stateOld = copy.deepcopy(state)
moveOld = copy.deepcopy(move)
state = copy.deepcopy(stateNew)
return Q, outcomes
def testQ(Q, maxSteps, validMovesF, makeMoveF,startState,goalState):
state = copy.copy(startState)
epsilon = 1.0
path = []
path.append(state)
done = False
step = 0
while not done:
step += 1
Qs = []
validMoveList = validMoves(state)
for m in validMoveList:
if stateMoveTuple(state, m) in Q:
Qs.append(Q[stateMoveTuple(state, m)])
else:
Qs.append(0xffffff)
stateNew = makeMoveF(state,validMoveList[np.argmin(Qs)])
path.append(stateNew)
if stateNew == goalState:
return path
done = True
elif step >=maxSteps:
print('Goal not reached in {} steps'.format(maxSteps))
return []
done = True
state = copy.deepcopy(stateNew)
def minsteps(steps,minStepOld,nRepetitions):
delStep =0
steps = list(steps)
# lengh = len(step)
while delStep != nRepetitions:
if np.mean(steps)>7:
steps.pop(0)
delStep += 1
else:
if delStep < minStepOld:
return delStep,True
else:
return minStepOld,False
if delStep < minStepOld:
return delStep,True
else:
return minStepOld,False
def findBetter(nRepetitions,learningRate,epsilonDecayFactor):
Q, steps = trainQ(nRepetitions, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
minStepOld,_ = minsteps(steps,0xffffff,50)
bestlRate = 0.5
besteFactor = 0.7
LAndE = []
for k in range(10):
for i in range(len(learningRate)):
for j in range(len(epsilonDecayFactor)):
Q, steps = trainQ(nRepetitions, learningRate[i], epsilonDecayFactor[j], validMoves, makeMove,\
startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
minStepNew,B = minsteps(steps,minStepOld,nRepetitions)
if B:
bestlRate = learningRate[i]
besteFactor = epsilonDecayFactor[j]
minStepOld = copy.deepcopy(minStepNew)
LAndE.append([bestlRate,besteFactor])
return LAndE
Test part
state = [[1, 2, 3], [], []]
printState(state)
1
2
3
------
state = [[1, 2, 3], [], []]
move =[1, 2]
stateMoveTuple(state, move)
(((1, 2, 3), (), ()), (1, 2))
state = [[1, 2, 3], [], []]
newstate = makeMove(state, move)
newstate
[[2, 3], [1], []]
Q, stepsToGoal = trainQ(100, 0.5, 0.7, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path = testQ(Q, 20, validMoves, makeMove,startState = [[1, 2, 3], [], []],goalState = [[], [], [1, 2, 3]])
path
[[[1, 2, 3], [], []],
[[2, 3], [], [1]],
[[3], [2], [1]],
[[3], [1, 2], []],
[[], [1, 2], [3]],
[[1], [2], [3]],
[[1], [], [2, 3]],
[[], [], [1, 2, 3]]]
for s in path:
printState(s)
print()
1
2
3
------
2
3 1
------
3 2 1
------
1
3 2
------
1
2 3
------
1 2 3
------
2
1 3
------
1
2
3
------
# find better learningRate and epsilonDecayFactor
learningRate = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
epsilonDecayFactor = [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9]
LAndE = findBetter(100,learningRate,epsilonDecayFactor)
print(LAndE)
[[0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.2], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6], [0.9, 0.6]]