根据Q表估计
a1(选择1的赋值) |
a2(选择2的赋值) |
|
s1(行动1) |
-2 |
1 |
s2(行动2) |
-4 |
2 |
Q Learning算法(Q Learning Alogrithm):
#以下为伪代码
Initialize Q(s, a) arbitrarily
Repeat (for each episode):
Initialize s
Repeat (for each step of episode):
Choose a from s using policy derived from Q (e.g., ε-greedy)
Take action a, observe r, s'
Q(s, a) <- Q(s, a) + a[r + γmaxa' Q(s', a') – Q(s, a)]
s <- s';
until s is terminal
递推关系: