2024 年 2月 27 日随笔档案 - MoonOut

摘要：将 offline HIM 应用到 PbRL，① 用离线轨迹训练 a=π(s,z) ，② 训练最优 hindsight z* 靠近 z+ 远离 z-。阅读全文

posted @ 2024-02-27 21:38 MoonOut 阅读(65) 评论(0) 推荐(0) 编辑

摘要：将 offline 训练轨迹中，当前时刻之后发生的事作为 hindsight，从而训练出想要达到当前 hindsight 的 action。阅读全文

posted @ 2024-02-27 21:08 MoonOut 阅读(187) 评论(0) 推荐(0) 编辑

摘要： ① sequence: {s, a, R, s, ...}；② 在 s 的 decode 结果上加 MLP 预测 action；③ 给定 return-to-go 作为某种 hindsight。阅读全文

posted @ 2024-02-27 20:14 MoonOut 阅读(603) 评论(0) 推荐(2) 编辑

月出兮彩云归 🌙