Temporal-Difference Learning
TD在强化学习中处于中心位置,它结合了DP与MC两种思想。如MC, TD可以直接从原始经验中学习,且不需要对环境有整体的认知。也如DP一样,它不需要等到最终结果才开始学习,它Bootstrap,即它的每步估计会部分地基于之前的估计。
最简单的TD形式:
这个可被称为TD(0)或一步TD(one-step TD)。
# Tabular TD(0) for estimating v_pi
Input: the policy pi to be evaluated
Algorithm parameter: step size alpha in (0,1]
Initialize V(s), for all s in S_plus, arbitrarily except that V(terminal) = 0
Loop for each episode
Initialize S
for step in episode:
A = action given by pi for S
Take action A, observe R, S'
V(S) = V(S) + alpha *[ R+gamma V(S') - V(S)]
S = S'
if S == terminal:
break
TD error:
在每一时刻,TD error是因为估计所产生的误差。
Advantage of TD Prediction Methods
Sarsa: On-policy TD Control
Sarsa (State, Action, Reward, State, Action) 表达是这个五元组元素之间的关系。TD error 可表示
# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Loop for each step of episode:
Take action A, observe R, S'
Choose A' from S' using policy derived from Q (e.g., epsilon-greedy)
Q(S,A) = Q(S,A) + alpha[R + gamma Q(S',A') - Q(S,A)]
S = S',A=A'
if S = terminal:
break
Q-learning: Off-policy TD Control
# Sarsa (on-policy TD control) for estimating Q = q
Algorithm parameters: step size alpha in (0,1], small epsilon > 0
Initialize Q(s,a), for all s in S_plus, a in A(s), arbitrarily except that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using policy derived from Q (e.g., epsilon-greedy)
Take action A, observe R, S'
Q(S,A) = Q(S,A) + alpha[R + gamma max_a Q(S',a) - Q(S,A)]
S = S'
if S = terminal:
break
Q-learning 直接逼近q*, 最优的action-value 函数独立于行为策略。
Expected Sarsa
Double Q-learning
# Double Q-learning, for estimating Q1 = Q2 = q*
Algorithm parameters: step size alpha in (0,1],small epsilon >0
Initialize Q1(s,a) and Q2(s,a), for all s in S_plus, a in A(s), such that Q(terminal,.) = 0
Loop for each episode:
Initialize S
Loop for each step of episode:
Choose A from S using the policy epsilon-greedy in Q1+Q2
Take action A, observe R, S'
with 0.5 probability:
Q1(S,A) = Q1(S,A) + alpha(R + gamma Q2(S',arg max_a Q1(S',a)) - Q1(S,A))
else:
Q2(S,A) = Q2(S,A) + alpha(R + gamma Q1(S',arg max_a Q2(S',a)) - Q2(S,A))
S = S'
if S = terminal:
break
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 10年+ .NET Coder 心语 ── 封装的思维:从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源!
· 周边上新:园子的第一款马克杯温暖上架