强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)

强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)

学习笔记:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016

数学符号看不懂的,先看看这里:

时序差分学习简话

时序差分学习结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。

时序差分这个词不好理解。改为当时差分学习比较形象一些 - 表示通过当前的差分数据来学习。

蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。
时序差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时序差分学习。
本章只考虑单步的时序差分学习。多步的时序差分学习在下一章讲解。

数学表示
根据我们已经知道的知识:如果可以计算出策略价值(π状态价值vπ(s),或者行动价值qπ(s,a)),就可以优化策略。
在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值Gt来计算状态的价值。其公式:
Formula MonteCarlo

(1)V(St)V(St)+αδtδt=[GtV(St)]whereδt - Monte Carlo errorα - learning step size

时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):
Formula TD(0)

(2)V(St)V(St)+αδtδt=[Rt+1+γ V(St+1V(St)]whereδt - TD errorα - learning step sizeγ - reward discount rate

注:书上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。

时序差分学习方法

本章介绍的是时序差分学习的单步学习方法。多步学习方法在下一章介绍。

  • 策略状态价值vπ的时序差分学习方法(单步\多步)
  • 策略行动价值qπ的on-policy时序差分学习方法: Sarsa(单步\多步)
  • 策略行动价值qπ的off-policy时序差分学习方法: Q-learning(单步)
  • Double Q-learning(单步)
  • 策略行动价值qπ的off-policy时序差分学习方法(带importance sampling): Sarsa(多步)
  • 策略行动价值qπ的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm(多步)
  • 策略行动价值qπ的off-policy时序差分学习方法: Q(σ)(多步)

策略状态价值vπ的时序差分学习方法

单步时序差分学习方法TD(0)

  • 流程图
Reinforcement Learning - TD0 Reinforcement Learning - TD0 s S s_1 S' s->s_1  A   R v V(S) s->v v_1 V(S') s_1->v_1 v->v_1 R
  • 算法描述

Initialize V(s) arbitrarily sS+
Repeat (for each episode):
  Initialize S
  Repeat (for each step of episode):
   A action given by π for S
   Take action A, observe R,S
   V(S)V(S)+α[R+γV(S)V(S)]
   SS
  Until S is terminal

多步时序差分学习方法

  • 流程图
Reinforcement Learning - TD n Reinforcement Learning - TD n s S s_1 ... s->s_1  A0   Rk v V(S) s->v s_2 Sn s_1->s_2  An-1   Rn v_1 V(...) s_1->v_1 v_2 V(Sn) s_2->v_2 v->v_1 Rk v_1->v_2 Rn
  • 算法描述

Input: the policy π to be evaluated
Initialize V(s) arbitrarily sS
Parameters: step size α(0,1], a positive integer n
All store and access operations (for St and Rt) can take their index mod n

Repeat (for each episode):
  Initialize and store S0terminal
T
  For t=0,1,2,:
   If t<T, then:
    Take an action according to π( ˙|St)
    Observe and store the next reward as Rt+1 and the next state as St+1
    If St+1 is terminal, then Tt+1
   τtn+1  (τ is the time whose state's estimate is being updated)
   If τ0:
    Gi=τ+1min(τ+n,T)γiτ1Ri
    if τ+nT then: GG+γnV(Sτ+n)(Gτ(n))
    V(Sτ)V(Sτ)+α[GV(Sτ)]
  Until τ=T1

这里要理解V(S0)是由V(S0),V(S1),,V(Sn)计算所得;V(S1)是由V(S1),V(S1),,V(Sn+1)

策略行动价值qπ的on-policy时序差分学习方法: Sarsa

单步时序差分学习方法

  • 流程图
Reinforcement Learning - TD Sarsa Reinforcement Learning - TD Sarsa s S s_1 S' s->s_1  A   R q Q(S, A) s->q q_1 Q(S', A') s_1->q_1 q->q_1 R
  • 算法描述

Initialize Q(s,a),sS,aA(s) arbitrarily, and Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize S
  Choose A from S using policy derived from Q (e.g. ϵgreedy)
  Repeat (for each step of episode):
   Take action A, observe R,S
   Choose A from S using policy derived from Q (e.g. ϵgreedy)
   Q(S,A)Q(S,A)+α[R+γQ(S,A)Q(S,A)]
   SS;AA;
  Until S is terminal

多步时序差分学习方法

  • 流程图
Reinforcement Learning - TD Sarsa Reinforcement Learning - TD Sarsa s S s_1 ... s->s_1  A   Rk q Q(S, A) s->q s_2 Sn s_1->s_2 An-1 Rn q_1 Q(...) s_1->q_1 q_2 Q(Sn, An) s_2->q_2 q->q_1 Rk q_1->q_2 Rn
  • 算法描述

Initialize Q(s,a) arbitrarily sS,ainA
Initialize π to be ϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α(0,1],
  small ϵ>0
  a positive integer n
All store and access operations (for St and Rt) can take their index mod n

Repeat (for each episode):
  Initialize and store S0terminal
  Select and store an action A0π( ˙|S0)
T
  For t=0,1,2,:
   If t<T, then:
    Take an action At
    Observe and store the next reward as Rt+1 and the next state as St+1
    If St+1 is terminal, then:
     Tt+1
    Else:
     Select and store an action At+1π( ˙|St+1)
   τtn+1  (τ is the time whose state's estimate is being updated)
   If τ0:
    Gi=τ+1min(τ+n,T)γiτ1Ri
    if τ+nT then: GG+γnQ(Sτ+n,Aτ+n)(Gτ(n))
    Q(Sτ,Aτ)Q(Sτ,Aτ)+α[GQ(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π( ˙|Sτ) is ϵ-greedy wrt Q
  Until τ=T1

策略行动价值qπ的off-policy时序差分学习方法: Q-learning

Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。

(3)Q(St,At)Q(St,At)+α[Rt+1+γmaxa Q(St+1,a)Q(St,At)]

单步时序差分学习方法

  • 算法描述

Initialize Q(s,a),sS,aA(s) arbitrarily, and Q(terminal, ˙)=0
Repeat (for each episode):
  Initialize S
  Choose A from S using policy derived from Q (e.g. ϵgreedy)
  Repeat (for each step of episode):
   Take action A, observe R,S
   Q(S,A)Q(S,A)+α[R+γmaxa Q(S,a)Q(S,A)]
   SS;
  Until S is terminal

  • Q-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
    具体说明,请看书上的Example 6.7。**
    使用Double Q-learning可以消除这个问题。

Double Q-learning

单步时序差分学习方法

Initialize Q1(s,a) and Q2(s,a),sS,aA(s) arbitrarily
Initialize Q1(terminal, ˙)=Q2(terminal, ˙)=0
Repeat (for each episode):
  Initialize S
  Repeat (for each step of episode):
   Choose A from S using policy derived from Q1 and Q2 (e.g. ϵgreedy)
   Take action A, observe R,S
   With 0.5 probability:
    Q1(S,A)Q1(S,A)+α[R+γQ2(S,argmaxa Q1(S,a))Q1(S,A)]
   Else:
    Q2(S,A)Q2(S,A)+α[R+γQ1(S,argmaxa Q2(S,a))Q2(S,A)]
   SS;
  Until S is terminal

策略行动价值qπ的off-policy时序差分学习方法(by importance sampling): Sarsa

考虑到重要样本,把ρ带入到Sarsa算法中,形成一个off-policy的方法。
ρ - 重要样本比率(importance sampling ratio)

(4)ρi=τ+1min(τ+n1,T1)π(At|St)μ(At|St)(ρτ+n(τ+1))

多步时序差分学习方法

  • 算法描述

Input: behavior policy \mu such that μ(a|s)>0sS,aA
Initialize Q(sa) arbitrarily sS,ainA
Initialize π to be ϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α(0,1],
  small ϵ>0
  a positive integer n
All store and access operations (for St and Rt) can take their index mod n

Repeat (for each episode):
  Initialize and store S0terminal
  Select and store an action A0μ( ˙|S0)
T
  For t=0,1,2,:
   If t<T, then:
    Take an action At
    Observe and store the next reward as Rt+1 and the next state as St+1
    If St+1 is terminal, then:
     Tt+1
    Else:
     Select and store an action At+1π( ˙|St+1)
   τtn+1  (τ is the time whose state's estimate is being updated)
   If τ0:
    ρi=τ+1min(τ+n1,T1)π(At|St)μ(At|St)(ρτ+n(τ+1))
    Gi=τ+1min(τ+n,T)γiτ1Ri
    if τ+nT then: GG+γnQ(Sτ+n,Aτ+n)(Gτ(n))
    Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[GQ(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π( ˙|Sτ) is ϵ-greedy wrt Q
  Until τ=T1

Expected Sarsa

  • 流程图
Reinforcement Learning - TD Expected Sarsa Reinforcement Learning - TD Expected Sarsa s S s_1 ... s->s_1  A   Rk q Q(S, A) s->q s_2 Sn s_1->s_2 An-1 Rn q_1 Q(...) s_1->q_1 q_2 sum(pi(a|Sn) * Q(Sn, a)) s_2->q_2 q->q_1 Rk q_1->q_2 Rn
* 算法描述 略。

策略行动价值qπ的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm

Tree Backup Algorithm的思想是每步都求行动价值的期望值。
求行动价值的期望值意味着对所有可能的行动a都评估一次。

多步时序差分学习方法

  • 流程图
Reinforcement Learning - TD Tree Backup Reinforcement Learning - TD Tree Backup s S s_1 ... s->s_1  A   Rk q Q(S, A) s->q s_2 Sn s_1->s_2 An-1 Rn q_1 sum(pi(a|...) * Q(..., a)) s_1->q_1 q_2 sum(pi(a|Sn) * Q(Sn, a)) s_2->q_2 q->q_1 Rk q_1->q_2 Rn
  • 算法描述

Initialize Q(sa) arbitrarily sS,ainA
Initialize π to be ϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α(0,1],
  small ϵ>0
  a positive integer n
All store and access operations (for St and Rt) can take their index mod n

Repeat (for each episode):
  Initialize and store S0terminal
  Select and store an action A0π( ˙|S0)
Q0Q(S0,A0)
T
  For t=0,1,2,:
   If t<T, then:
    Take an action At
    Observe and store the next reward as Rt+1 and the next state as St+1
    If St+1 is terminal, then:
     Tt+1
     δtRQt
    Else:
     δtR+γaπ(a|St+1)Q(St+1,a)Qt
     Select arbitrarily and store an action as At+1
     Qt+1Q(St+1,At+1)
     πt+1π(St+1,At+1)
   τtn+1  (τ is the time whose state's estimate is being updated)
   If τ0:
    E1
    GQτ
    For k=τ,,min(τ+n1,T1):
     G G+Eδk
     E γEπk+1
    Q(Sτ,Aτ)Q(Sτ,Aτ)+α[GQ(Sτ,Aτ)]
    If {\pi} is being learned, then ensure that π(a|Sτ) is ϵ-greedy wrt Q(Sτ, ˙)
  Until τ=T1

策略行动价值qπ的off-policy时序差分学习方法: Q(σ)

Q(σ)结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
σ=1时,使用了重要样本的Sarsa算法。
σ=0时,使用了Tree Backup的行动期望值算法。

多步时序差分学习方法

  • 流程图
Reinforcement Learning - TD Q(sigma) Reinforcement Learning - TD Q(sigma) s S s_1 ... s->s_1  A   R. q Q(S, A) s->q s_2 ... s_1->s_2 A. R. q_1 Q(...) s_1->q_1 sigma = 1 s_3 ... s_2->s_3 A. R. q_2 sum(pi(a|...) * Q(...,a)) s_2->q_2 sigma = 0 s_4 Sn s_3->s_4 An-1 Rn q_3 Q(...) s_3->q_3 sigma = 1 q_4 sum(pi(a|Sn) * Q(Sn,a)) s_4->q_4 sigma = 0 q->q_1 R. q_1->q_2 R. q_2->q_3 R. q_3->q_4 Rn
  • 算法描述

Input: behavior policy \mu such that μ(a|s)>0sS,aA
Initialize Q(sa) arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize π to be ϵ-greedy with respect to Q, or to a fixed given policy
Parameters: step size α(0,1],
  small ϵ>0
  a positive integer n
All store and access operations (for St and Rt) can take their index mod n

Repeat (for each episode):
  Initialize and store S0terminal
  Select and store an action A0μ( ˙|S0)
Q0Q(S0,A0)
T
  For t=0,1,2,:
   If t<T, then:
    Take an action At
    Observe and store the next reward as Rt+1 and the next state as St+1
    If St+1 is terminal, then:
     Tt+1
     δtRQt
    Else:
     Select and store an action as At+1μ( ˙|St+1)
     Select and store σt+1)
     Qt+1Q(St+1,At+1)
     δtR+γσt+1Qt+1+γ(1σt+1)aπ(a|St+1)Q(St+1,a)Qt
     πt+1π(St+1,At+1)
     ρt+1π(At+1|St+1)μ(At+1|St+1)
   τtn+1  (τ is the time whose state's estimate is being updated)
   If τ0:
    ρ1
    E1
    GQτ
    For k=τ,,min(τ+n1,T1):
     G G+Eδk
     E γE[(1σk+1)πk+1+σk+1]
     ρ ρ(1σk+σkτk)
    Q(Sτ,Aτ)Q(Sτ,Aτ)+αρ[GQ(Sτ,Aτ)]
    If π is being learned, then ensure that π(a|Sτ) is ϵ-greedy wrt Q(Sτ, ˙)
  Until τ=T1

总结

时序差分学习方法的限制:学习步数内,可获得奖赏信息。
比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。

参照

posted @   SNYang  阅读(12310)  评论(0编辑  收藏  举报
编辑推荐:
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
阅读排行:
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具
点击右上角即可分享
微信分享提示