强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
强化学习读书笔记 - 06~07 - 时序差分学习(Temporal-Difference Learning)
学习笔记:
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto c 2014, 2015, 2016
数学符号看不懂的,先看看这里:
时序差分学习简话
时序差分学习结合了动态规划和蒙特卡洛方法,是强化学习的核心思想。
时序差分这个词不好理解。改为当时差分学习比较形象一些 - 表示通过当前的差分数据来学习。
蒙特卡洛的方法是模拟(或者经历)一段情节,在情节结束后,根据情节上各个状态的价值,来估计状态价值。
时序差分学习是模拟(或者经历)一段情节,每行动一步(或者几步),根据新状态的价值,然后估计执行前的状态价值。
可以认为蒙特卡洛的方法是最大步数的时序差分学习。
本章只考虑单步的时序差分学习。多步的时序差分学习在下一章讲解。
数学表示
根据我们已经知道的知识:如果可以计算出策略价值(状态价值,或者行动价值),就可以优化策略。
在蒙特卡洛方法中,计算策略的价值,需要完成一个情节(episode),通过情节的目标价值来计算状态的价值。其公式:
Formula MonteCarlo
时序差分的思想是通过下一个状态的价值计算状态的价值,形成一个迭代公式(又):
Formula TD(0)
注:书上提出TD error并不精确,而Monte Carlo error是精确地。需要了解,在此并不拗述。
时序差分学习方法
本章介绍的是时序差分学习的单步学习方法。多步学习方法在下一章介绍。
- 策略状态价值的时序差分学习方法(单步\多步)
- 策略行动价值的on-policy时序差分学习方法: Sarsa(单步\多步)
- 策略行动价值的off-policy时序差分学习方法: Q-learning(单步)
- Double Q-learning(单步)
- 策略行动价值的off-policy时序差分学习方法(带importance sampling): Sarsa(多步)
- 策略行动价值的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm(多步)
- 策略行动价值的off-policy时序差分学习方法: (多步)
策略状态价值的时序差分学习方法
单步时序差分学习方法TD(0)
- 流程图
- 算法描述
Initialize arbitrarily
Repeat (for each episode):
Initialize
Repeat (for each step of episode):
action given by for
Take action , observe
Until S is terminal
多步时序差分学习方法
- 流程图
- 算法描述
Input: the policy to be evaluated
Initialize arbitrarily
Parameters: step size , a positive integer
All store and access operations (for and ) can take their index modRepeat (for each episode):
Initialize and store
For :
If , then:
Take an action according to
Observe and store the next reward as and the next state as
If is terminal, then
( is the time whose state's estimate is being updated)
If :
if then:
Until
这里要理解是由计算所得;是由。
策略行动价值的on-policy时序差分学习方法: Sarsa
单步时序差分学习方法
- 流程图
- 算法描述
Initialize arbitrarily, and
Repeat (for each episode):
Initialize
Choose from using policy derived from (e.g. )
Repeat (for each step of episode):
Take action , observe
Choose from using policy derived from (e.g. )
Until S is terminal
多步时序差分学习方法
- 流程图
- 算法描述
Initialize arbitrarily
Initialize to be -greedy with respect to Q, or to a fixed given policy
Parameters: step size ,
small
a positive integer
All store and access operations (for and ) can take their index modRepeat (for each episode):
Initialize and store
Select and store an action
For :
If , then:
Take an action
Observe and store the next reward as and the next state as
If is terminal, then:
Else:
Select and store an action
( is the time whose state's estimate is being updated)
If :
if then:
If {\pi} is being learned, then ensure that is -greedy wrt Q
Until
策略行动价值的off-policy时序差分学习方法: Q-learning
Q-learning 算法(Watkins, 1989)是一个突破性的算法。这里利用了这个公式进行off-policy学习。
单步时序差分学习方法
- 算法描述
Initialize arbitrarily, and
Repeat (for each episode):
Initialize
Choose from using policy derived from (e.g. )
Repeat (for each step of episode):
Take action , observe
Until S is terminal
- Q-learning使用了max,会引起一个最大化偏差(Maximization Bias)问题。
具体说明,请看书上的Example 6.7。**
使用Double Q-learning可以消除这个问题。
Double Q-learning
单步时序差分学习方法
Initialize and arbitrarily
Initialize
Repeat (for each episode):
Initialize
Repeat (for each step of episode):
Choose from using policy derived from and (e.g. )
Take action , observe
With 0.5 probability:
Else:
Until S is terminal
策略行动价值的off-policy时序差分学习方法(by importance sampling): Sarsa
考虑到重要样本,把带入到Sarsa算法中,形成一个off-policy的方法。
- 重要样本比率(importance sampling ratio)
多步时序差分学习方法
- 算法描述
Input: behavior policy \mu such that
Initialize arbitrarily
Initialize to be -greedy with respect to Q, or to a fixed given policy
Parameters: step size ,
small
a positive integer
All store and access operations (for and ) can take their index modRepeat (for each episode):
Initialize and store
Select and store an action
For :
If , then:
Take an action
Observe and store the next reward as and the next state as
If is terminal, then:
Else:
Select and store an action
( is the time whose state's estimate is being updated)
If :
if then:
If {\pi} is being learned, then ensure that is -greedy wrt Q
Until
Expected Sarsa
- 流程图
策略行动价值的off-policy时序差分学习方法(不带importance sampling): Tree Backup Algorithm
Tree Backup Algorithm的思想是每步都求行动价值的期望值。
求行动价值的期望值意味着对所有可能的行动都评估一次。
多步时序差分学习方法
- 流程图
- 算法描述
Initialize arbitrarily
Initialize to be -greedy with respect to Q, or to a fixed given policy
Parameters: step size ,
small
a positive integer
All store and access operations (for and ) can take their index modRepeat (for each episode):
Initialize and store
Select and store an action
For :
If , then:
Take an action
Observe and store the next reward as and the next state as
If is terminal, then:
Else:
Select arbitrarily and store an action as
( is the time whose state's estimate is being updated)
If :
For
If {\pi} is being learned, then ensure that is -greedy wrt
Until
策略行动价值的off-policy时序差分学习方法:
结合了Sarsa(importance sampling), Expected Sarsa, Tree Backup算法,并考虑了重要样本。
当时,使用了重要样本的Sarsa算法。
当时,使用了Tree Backup的行动期望值算法。
多步时序差分学习方法
- 流程图
- 算法描述
Input: behavior policy \mu such that
Initialize arbitrarily \forall s \in \mathcal{S}^, \forall a in \mathcal{A}$
Initialize to be -greedy with respect to Q, or to a fixed given policy
Parameters: step size ,
small
a positive integer
All store and access operations (for and ) can take their index modRepeat (for each episode):
Initialize and store
Select and store an action
For :
If , then:
Take an action
Observe and store the next reward as and the next state as
If is terminal, then:
Else:
Select and store an action as
Select and store
( is the time whose state's estimate is being updated)
If :
For
If is being learned, then ensure that is -greedy wrt
Until
总结
时序差分学习方法的限制:学习步数内,可获得奖赏信息。
比如,国际象棋的每一步,是否可以计算出一个奖赏信息?如果使用蒙特卡洛方法,模拟到游戏结束,肯定是可以获得一个奖赏结果的。
参照
请“推荐”本文!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 开发者必知的日志记录最佳实践
· SQL Server 2025 AI相关能力初探
· Linux系列:如何用 C#调用 C方法造成内存泄露
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 阿里最新开源QwQ-32B,效果媲美deepseek-r1满血版,部署成本又又又降低了!
· Manus重磅发布:全球首款通用AI代理技术深度解析与实战指南
· 开源Multi-agent AI智能体框架aevatar.ai,欢迎大家贡献代码
· 被坑几百块钱后,我竟然真的恢复了删除的微信聊天记录!
· AI技术革命,工作效率10个最佳AI工具