offline RL | HIM：基于 hindsight 的 RL 是一类大 idea

题目：Generalized Decision Transformer for Offline Hindsight Information Matching，ICLR 2022，6 8 8 spotlight。其中一个 8 分是从 5 分 rebuttal 上来的；貌似对于其他 reviewer，rebuttal 也提分很多。
pdf 版本：https://arxiv.org/pdf/2111.10364.pdf
html 版本：https://ar5iv.labs.arxiv.org/html/2111.10364
open review：https://openreview.net/forum?id=CAjxVodl_v
GitHub：https://github.com/frt03/generalized_dt

主要内容

（其实没有特别看懂，但感觉是激动人心的 idea…）
贡献 1：貌似对于很多 offline / online RL 工作，都可以归入 HIM（hindsight information matching）算法框架。
- HIM：希望去训练一个 conditional policy \(a=\pi(s,z)\)，其中 z 是一些 condition。具体的，z 可能是在一条训练轨迹中，当前时刻之后发生的事情（hindsight）；我们把 z 和轨迹中的 action 送进模型，试图让它学出达成任意 hindsight 的 action，思想类似于 HER。
- z 可以采用不同形式，比如轨迹中会达到的 state（HER）、这条轨迹的 return-to-go（DT），也可以是一些需要训练的神秘 latent。
贡献 2：提出了基于 decision transformer 的解 HIM 问题的通用方法（General Decision Transformer），并实例化了两个算法 CDT 和 BDT。
- CDT：categorical DT，貌似是采用一些 state 或 reward 作为 hindsight，然后把 hindsight space 进行离散化，这样就能用 cross-entropy loss 做 categorical 的分类学习了。
- BDT：Bi-directional DT，是神秘的 one-shot offline imitation learning。会使用两个 transformer，一个用来生成 policy \(a=\pi(s,z)\)，另一个学习倒序的 state sequence。（没太看懂）

主要内容
0 abstract
1 定义两个相关问题
- 1.1 State Marginal Matching（状态边缘分布的匹配）
- 1.2 Parameterized RL Objectives（参数化的 RL 目标函数）
2 Hindsight Information Matching（后见之明信息匹配）
3 General Decision Transformer
4 experiments

0 abstract

How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning, e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. CDT, which simply replaces anti-causal summation with anti-causal binning in DT, enables the first effective offline multi-task SMM algorithm that generalizes well to unseen and even synthetic multi-modal state-feature distributions. BDT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.

background：
- 如何从每个 trajectory 中提取尽可能多的学习信号，一直是 RL 的关键问题，RL 的 low sample-efficiency 给实际应用带来了严峻挑战。
- 近期研究表明，使用 expressive 的 policy function approximators，并且去 conditioning on future trajectory information（例如 hindsight experience replay（HER）中的 future states，或 decision transformer 中的 returns-to-go），可以有效学习 multi-task policies。
- 其中有时，online RL 被 offline behavior cloning 完全取代，例如序列建模（sequence modeling）。
method：
- 我们说明，所有这些方法，都在做事后信息匹配（hindsight information matching，HIM）—— 训练一个策略，使得它可以输出 trajectory 的其余部分，这个其余部分与未来状态信息的一些统计量相匹配。
- 我们提出了用于求解任何 HIM 问题的 Generalized Decision Transformer（GDT）。
- 在特征函数（feature function）和反因果聚合器（anti-causal aggregator）的不同选择下，DT 是 GDT 的特例，并且还导出了新的 Categorical DT（CDT）和 Bi-directional DT（BDT），来匹配不同的未来统计量。
experiment：
- 为了评估 CDT 和 BDT，我们将 offline 的多任务状态边缘匹配（state-marginal matching，SMM）和模仿学习（IL）定义为两个通用的 HIM 问题，将 Wasserstein 距离损失作为 metric，并在 MuJoCo continuous control benchmark 上做实验。
- CDT 非常简单，用 DT 中的反因果合并（anti-causal binning）取代了反因果求和（anti-causal summation），实现了第一个有效的 offline multi-task SMM 算法，可以很好地推广到 unseen 甚至合成的多模态 reward 或状态特征分布（state-feature distribution）。
- BDT 使用一个 anti-causal second transformer 作为聚合器（aggregator），可以学习对未来的任何统计量进行建模，并在 offline multi-task IL（one-shot IL）中 outperform 了 DT 变体。
从 HIM 和 GDT 得到的 fumulation，极大地扩展了强大的 sequence modeling architectures 在现代 RL 中的作用。

1 定义两个相关问题

1.1 State Marginal Matching（状态边缘分布的匹配）

普通 RL 的 objective： \(L_{RL}=\frac{1}{1-\gamma}\mathbb E_{s\sim \rho^{\pi}(s),a\sim\pi(\cdot|s)}[r(s,a)]\) 。（公式 1）
SMM：目标不是稳态奖励最大化，而是找到一种策略，最小化该 policy 的状态边缘分布 \(\rho^{\pi}(s)\) 与给定目标分布 \(\rho^*(s)\) 的差异，\(L_{SMM}=D(\rho^{\pi}(s),\rho^*(s))\) ，D 是某种散度。（公式 2）
SMM 跟 imitation learning 的内在很接近。

1.2 Parameterized RL Objectives（参数化的 RL 目标函数）

给定参数化的 reward function，参数是 z，去学习一个 conditional policy \(\pi(a|s,z)\)。
z 的分布是 p(z)。
这样，公式 1 就变成： \(L_{RL}(\pi)=\mathbb E_z[L_{RL}(\pi,z)]=\frac{1}{1-\gamma}\mathbb E_{z\sim p(z),s\sim \rho^{\pi}(s),a\sim\pi(\cdot|s)}[r_z(s,a)]\) 。（公式 3）
（可能比如 HER，改变 conditional goal，就可以得到不同的 reward function；有些射门给 1 reward，有些踢到门柱给 1 reward）
在下文中的 z，是 information statistics 的形式。

2 Hindsight Information Matching（后见之明信息匹配）

（是 method 章节）

定义 information statistics：
- information statistics：给定一个从 \(s_t\) 开始的 trajectory \(\tau_t=\{s_t,a_t,s_{t+1},\cdots\}\) （也可能是其他形式，如 DT 的 {s, a, R, s, ...} ？），我们定义它的 information statistics \(I(\tau_t)\) ，它是 trajectory 的函数。
- 特征函数：再定义特征函数 \(\Phi(s_t,a_t)\in F\) ，然后就可以把 trajectory 写为 \(\tau_t^\Phi=\{\phi_t=\Phi(s_t,a_t),\phi_{t+1},\cdots\}\) 。Φ 可能是恒等函数、reward function 或 state-action 的某些子维度。
- 特征函数貌似是用来降维 trajectory 的，用降维后的 trajectory 的特征，再计算 information statistics（？）
- 在某些工作里，information statistics 是一个学出来的网络。（像 DT 那样，对不同 modality（s a R）分别转化为 token？）
定义 information matching 问题：
- 学习条件策略 \(\pi(a|s,z)\) ，希望它的 trajectory 的 information statistics，尽量与给定的 information statistics z 接近。
- 优化目标： \(\min_\pi\mathbb E_{z\sim p(z),\tau\sim\rho_z^\pi(\tau)}[D(I^\Phi(\tau),z)]\) 。（公式 4）
- z 可能满足环境的某些分布 p(z)。（z 的形式貌似是自己定义的。比如对 HER，若把目标（z）设为踢球踢到门柱，则希望在这个 z 下生成的轨迹都能踢到门柱。虽然 HER 希望我们把 z 定为球门时，踢球就可以都成功射门（）好像这样的话，z 的分布不太 IID）
训练：
- 对于任意轨迹 τ，设置 \(z^*=I^\Phi(\tau)\) 即可直接得到散度 D = 0。所以，设置 \(z^*=I^\Phi(\tau)\) 就可以训 \(\pi_{z^*}\) ，类似 HER（学学踢门柱策略）。
- 不同的 information statistics \(I(\tau_t)\) 选择，对应不同的算法（table 1）。

3 General Decision Transformer

希望 CDT 和 BDT 解决的问题：offline multi-task state-marginal matching (SMM) 和 offline multi-task imitation learning (IL)。

各个算法的伪代码：论文最后一页 Appendix F 的 Algorithm 1。

3.1 CDT: Categorical Decision Transformer

用于 offline SMM。
假设一个低维的 Φ（特征函数）例如奖励或状态维度（xyz 速度），并把特征空间离散化，这样就可以做 categorical。
貌似跟训 DT 差不多，拿 R 或下一时刻的 s' 作为 hindsight，来训出，给定当前 state s 的 action a。

3.2 Decision Transformer with Learned Φ

（没有看懂）objective 类似 adversarial inverse RL（对抗的逆 RL，给定 \(s_t,s_{t+1}\) 学习 \(a_t\)）。

3.3 BDT: Bi-directional Decision Transformer for One-Shot Imitation Learning

用于 one-shot 模仿学习的双向 DT。
这里的 hindsight z 是一个 target state sequence。
有第二个 DT，它是 anti-causal 反因果的，将 reverse-order 的 target state sequence 作为输入。（没有很明白）
貌似在 evaluate 时，就将目前已经跑过的 state sequence 作为 hindsight z。

4 experiments

主要关注的点：（Pieter Abbeel 的 experiments 章节的问题）

（SMM）CDT 能否匹配到 unseen 的 reward distribution？
（SMM）CDT 能否匹配并泛化到 unseen 的 1D/2D state-feature distributions？
（SMM）CDT 能否匹配到 unseen 的合成的（synthesized）state-feature distributions？
（IL） BDT 能否在完整状态下（in full state）进行 offline one-shot imitation learning？

因为 hindsight 有点监督学习的感觉，因此 evaluate 时，我们给定 hindsight，考察 policy 再现 hindsight 的情况。

实验：使用 D4RL medium-expert 数据集，按累积奖励对所有轨迹进行排序，将五个最佳轨迹和五个 50% 的轨迹作为测试集（总共 10 个轨迹），并将其余轨迹用作训练集。
评估 CDT / BDT：我们跑的 trajectory 与 target trajectory 的 feature，的 categorical distribution 之间的 Wasserstein-1 距离。

疑惑：

在做 SMM 时，对于合成的 Bi-Modal 分布（貌似是 halfcheetah 的前空翻 + 后空翻），我们是只学到了相似的 action distribution，还是真的学会了前后空翻（？）
突然感觉 few-shot RL、meta RL（快速泛化到新任务）、multi-task RL 是有趣的。

posted @ 2024-02-27 21:08 MoonOut 阅读(319) 评论(0) 收藏举报

刷新页面返回顶部

月出兮彩云归 🌙