根据Q表估计

 

a1(选择1的赋值)

a2(选择2的赋值)

s1(行动1)

-2

1

s2(行动2)

-4

2

 

Q Learning算法(Q Learning Alogrithm):

#以下为伪代码

Initialize Q(s, a) arbitrarily
    Repeat (for each episode):
        Initialize s

        Repeat (for each step of episode):
            Choose a from s using policy derived from Q (e.g., ε-greedy)
            Take action a, observe r, s'
            Q(s, a) <- Q(s, a) + a[r + γmaxa' Q(s', a') – Q(s, a)]
            s <- s';

    until s is terminal     

 

递推关系

posted on 2017-11-25 19:56  历久弥坚0820  阅读(312)  评论(0编辑  收藏  举报