MORL | 速通三大会的 MORL 工作


目录


—— · ——



🎯 文章列表

NeurIPS

https://proceedings.neurips.cc/papers/search?q=Multi+Objective+Reinforcement+Learning

ICLR

https://iclr.cc/virtual/2023/papers.html?filter=titles&search=Multi-Objective+Reinforcement+Learning

ICML


—— · ——



🎯 2019

Dynamic Weights in Multi-Objective Deep Reinforcement Learning (ICML)

Many real-world decision problems are characterized by multiple conflicting objectives which must be balanced based on their relative importance. In the dynamic weights setting the relative importance changes over time and specialized algorithms that deal with such change, such as a tabular Reinforcement Learning (RL) algorithm by Natarajan and Tadepalli (2005), are required. However, this earlier work is not feasible for RL settings that necessitate the use of function approximators. We generalize across weight changes and high-dimensional inputs by proposing a multi-objective Q-network whose outputs are conditioned on the relative importance of objectives and we introduce Diverse Experience Replay (DER) to counter the inherent non-stationarity of the Dynamic Weights setting. We perform an extensive experimental evaluation and compare our methods to adapted algorithms from Deep Multi-Task/Multi-Objective Reinforcement Learning and show that our proposed network in combination with DER dominates these adapted algorithms across weight change scenarios and problem domains.

  • background:
    • 许多现实世界中的决策问题,都具有多个相互冲突的目标的特征,需要根据这些目标的相对重要性进行平衡。
    • 在动态权重设置中,需要相对重要性随时间的变化,以及处理这种变化的专门算法,例如 Natarajan 和 Tadeballi(2005)的表格强化学习(RL)算法。
    • 然而,对于需要使用函数逼近器的 RL 设置,这项早期工作是不可行的。
  • method:
    • 我们通过提出一个多目标 Q 网络来概括权重变化和高维输入,该网络的输出以目标的相对重要性为条件。
    • 引入了多样化经验回放(DER),来对抗动态权重设置的固有非平稳性。
    • DER 貌似就是在存 (s,a,s') 或 (s,a,w,r,s') 时,尽量存与 DER buffer 里现有 transition 不一样的 transition。
    • 然后,这里的 Q update 貌似是 Q(s,a,w) ← r(s,a,w) + γ argmax a' Q(s',a',w),没有在后面一项同时 argmax w'。
  • experiment:
    • 我们进行了广泛的实验评估,并将我们的方法与深度多任务 / 多目标 RL 中的自适应算法进行了比较。
    • 结果表明,我们提出的网络与 DER 相结合,在权重变化场景和问题域中,dominate 了这些自适应算法。

草率看了一下,感觉是下面 A generalized alg for MORL 的降级版…

A Generalized Algorithm for Multi-Objective Reinforcement Learning and Policy Adaptation (NeurIPS)

感觉是经典文章,可能是第一篇提出理论(最优性 收敛性这些)的工作,等有时间专门写一篇博客来总结。

并且也可以看这篇 知乎博客 ,感觉写的蛮好。

🎯 2020

Prediction-Guided Multi-Objective Reinforcement Learning for Continuous Robot Control (ICML)

Many real-world control problems involve conflicting objectives where we desire a dense and high-quality set of control policies that are optimal for different objective preferences (called Pareto-optimal). While extensive research in multi-objective reinforcement learning (MORL) has been conducted to tackle such problems, multi-objective optimization for complex continuous robot control is still under-explored. In this work, we propose an efficient evolutionary learning algorithm to find the Pareto set approximation for continuous robot control problems, by extending a state-of-the-art RL algorithm and presenting a novel prediction model to guide the learning process. In addition to efficiently discovering the individual policies on the Pareto front, we construct a continuous set of Pareto-optimal solutions by Pareto analysis and interpolation. Furthermore, we design seven multi-objective RL environments with continuous action space, which is the first benchmark platform to evaluate MORL algorithms on various robot control problems. We test the previous methods on the proposed benchmark problems, and the experiments show that our approach is able to find a much denser and higher-quality set of Pareto policies than the existing algorithms.

  • 复杂连续机器人控制的多目标优化。

  • method:

    • 通过扩展最先进的 RL 算法,提出了一种有效的进化学习算法。
    • 提出一种新的预测模型,来指导学习过程,来找到连续机器人控制问题的 Pareto 集近似。
    • 除了有效发现帕累托前沿的个别 policy 外,我们还通过帕累托分析和插值,构建了一组连续的帕累托最优解。
    • (应该没有理论,没有收敛性证明之类)
  • experiment:

    • 我们设计了七个具有连续动作空间的多目标 RL 环境,这是第一个评估 MORL 算法在各种机器人控制问题上的 benchmark。(大概就是 halfcheetah hopper walker2d 那些)
    • 找到比现有算法更密集、质量更高的 Pareto 策略集。

介绍 Multi-Objective Policy Gradient(详见 3.2 节):

  • \(J(\theta,w)=\boldsymbol w^T\boldsymbol R(\tau)\)
  • \(\nabla_\theta J(\theta,w)=\mathbb E\bigg[\boldsymbol w^T\boldsymbol r(s,a)\nabla_\theta\log\pi_\theta(a|s)\bigg]\)

算法大致流程:(貌似跟遗传算法相似?这是 ai 帮忙读的,我也没细读)

  1. 初始化阶段

    • 随机初始化 n 个策略,并使用多目标策略梯度(MOPG)算法优化这些策略,每个策略都带有不同的权重。这一步的目的是为了从低性能区域中跳出,为后续的进化学习打下基础。
  2. 进化阶段

    • 预测改进模型:为每个策略学习一个分析模型,该模型能够预测在给定权重下,策略经过多目标策略梯度优化后的预期性能改进。这个模型是基于一个单调的双曲模型,它假设权重的增加会导致相应目标的性能提升。
    • 任务选择:使用预测模型来指导一个选择优化算法,该算法选择 n 个策略-权重对(这里称为 RL 任务),这些任务预期能够最大程度地提高 Pareto 集的质量。
    • 多目标策略梯度优化:并行地对选定的 RL 任务进行优化,使用 MOPG 算法进行固定次数的迭代,以产生新的后代策略。
    • 种群更新:使用性能缓冲策略来维护解决方案的性能和多样性,更新策略种群。
    • 外部 Pareto 档案:维护一个外部的 Pareto 档案,用于存储所有非支配的中间策略,并在进化阶段结束时输出近似的 Pareto 集。
  3. Pareto分析阶段

    • 连续Pareto表示:对计算出的 Pareto 最优策略进行 Pareto 分析,以识别不同的策略家族,并通过家族内插值提取 Pareto 集合的连续表示。

具体做法和技术:

  • 多目标策略梯度(MOPG):使用 Proximal Policy Optimization (PPO) 算法作为基础,通过扩展价值函数和策略梯度来处理多目标问题。

  • 预测改进模型:使用非线性最小二乘回归来拟合每个策略的双曲模型参数。这个模型基于一个直观的观察,即权重 w 的增加会导致相应目标的性能提升。

  • 任务选择:采用一个贪婪算法来选择任务,该算法基于预测模型预测的性能改进来选择任务,以最大化 Pareto 质量。

  • 连续 Pareto 表示:使用 t-SNE 方法将高维策略参数空间嵌入到二维空间中,以便更好地可视化。然后使用 k-means 聚类算法将嵌入后的策略分组到不同的家族中。每个家族负责 Pareto 前沿上的一个连续段。通过在家族内部进行线性插值,可以构造出连续的 Pareto 前沿表示。

  • 整个算法流程旨在,高效地发现 Pareto 前沿上的策略,并通过连续的 Pareto 表示来提供一个更密集和更高品质的策略集合。

🎯 2021

Using Logical Specifications of Objectives in Multi-Objective Reinforcement Learning (ICML)

It is notoriously difficult to control the behavior of reinforcement learning agents. Agents often learn to exploit the environment or reward signal and need to be retrained multiple times. The multi-objective reinforcement learning (MORL) framework separates a reward function into several objectives. An ideal MORL agent learns to generalize to novel combinations of objectives allowing for better control of an agent's behavior without requiring retraining. Many MORL approaches use a weight vector to parameterize the importance of each objective. However, this approach suffers from lack of expressiveness and interpretability. We propose using propositional logic to specify the importance of multiple objectives. By using a logic where predicates correspond directly to objectives, specifications are inherently more interpretable. Additionally the set of specifications that can be expressed with formal languages is a superset of what can be expressed by weight vectors. In this paper, we define a formal language based on propositional logic with quantitative semantics. We encode logical specifications using a recurrent neural network and show that MORL agents parameterized by these encodings are able to generalize to novel specifications over objectives and achieve performance comparable to single objective baselines.

  • background:
    • 多目标强化学习(MORL)框架将奖励函数划分为多个目标。一个理想的 MORL 代理学习推广到新的目标组合,从而在不需要重新训练的情况下更好地控制代理的行为。
    • 许多 MORL 方法使用权重向量,来参数化每个目标的重要性。然而,这种方法缺乏表现力和可解释性。
  • intuition:
    • 我们建议,使用命题逻辑来指定多个目标的重要性。通过使用谓词与目标直接对应的逻辑,使得 规范 更易于解释。
    • 此外,可以用形式语言表达的规范集是,可以用权重向量表达的 规范的超集。
  • method:
    • 在本文中,我们定义了一种基于命题逻辑和定量语义的形式语言。
    • 我们使用递归神经网络对逻辑规范进行编码。
    • 实验:通过这些编码参数化的 MORL 代理,能够在目标上推广到新的规范,并实现与单目标 baselines 相当的性能。
    • (实验环境好像是一个大 grid world)

读 poster:(ai 帮忙读的)

传统 MORL 使用权重向量来指定目标优先级,难以解释且难以操作。逻辑规范提供了一种更直观和可解释的方式来指定目标,例如使用逻辑与(∧)和逻辑或(∨)来组合不同的目标。

目标的形式,大概就是这样:

\[\begin{aligned} o_2>1 \land o_3 \\ o_3 ≥ .9 \lor o_3 ≤0 \\ o_1≥.8 \land o_3 ≤.6 \\ \end{aligned} \]

对应数值的 reward,可以翻译成这样:

\[\begin{aligned} f(\mathbf{r},o_n)& =\mathbf{r}[n] \\ f(\mathbf{r},\neg o_n)& =1-\mathbf{r}[n] \\ f(\mathbf{r},o_n\geq c)& =1 ~~ if ~~ \mathbf{r}[n]\geq c ~~ else ~~ 0 \\ f(\mathbf{r},o_n\leq c)& =1 ~~ if ~~ \mathbf{r}[n]\leq c ~~ else ~~ 0 \\ f(\mathbf{r},\psi_1\wedge\psi_2)& =\min(\psi_1,\psi_2) \\ f(\mathbf{r},\psi_1\vee\psi_2)& =\max(\psi_1,\psi_2) \end{aligned} \]

然后,使用一个 RNN(具体好像是 GRU),对这个逻辑语句进行神秘 one-hot 编码,得到 embedding ψ,作为 agent 输入的一部分。policy 的形式为 a = π(s, ψ),reward 的形式为 r(s, a, ψ)。

然后貌似就直接开训,MORL 的算法貌似使用 DQN。

Provably Efficient Algorithms for Multi-Objective Competitive RL (ICML)

貌似是 oral paper(??)

We study multi-objective reinforcement learning (RL) where an agent's reward is represented as a vector. In settings where an agent competes against opponents, its performance is measured by the distance of its average return vector to a target set. We develop statistically and computationally efficient algorithms to approach the associated target set. Our results extend Blackwell's approachability theorem~\citep{blackwell1956analog} to tabular RL, where strategic exploration becomes essential. The algorithms presented are adaptive; their guarantees hold even without Blackwell's approachability condition. If the opponents use fixed policies, we give an improved rate of approaching the target set while also tackling the more ambitious goal of simultaneously minimizing a scalar cost function. We discuss our analysis for this special case by relating our results to previous works on constrained RL. To our knowledge, this work provides the first provably efficient algorithms for vector-valued Markov games and our theoretical guarantees are near-optimal.

  • 在代理与对手竞争的环境中,其性能是通过其平均返回向量到目标集的距离来衡量的。我们开发了统计和计算高效的算法来接近相关的目标集。我们的结果将 Blackwell 的可接近性定理扩展到表格RL,其中战略探索变得至关重要。所提出的算法是自适应的;即使没有布莱克威尔的平易近人条件,他们的保证也成立。如果对手使用固定策略,我们可以提高接近目标集的速度,同时实现同时最小化标量成本函数的更宏伟目标。我们通过将我们的结果与先前关于约束RL的工作联系起来,讨论了我们对这种特殊情况的分析。据我们所知,这项工作为向量值马尔可夫对策提供了第一个可证明有效的算法,并且我们的理论保证是接近最优的。

看不懂,talk 也听不懂;太数学了w。

Accommodating Picky Customers: Regret Bound and Exploration Complexity for Multi-Objective Reinforcement Learning (NeurIPS)

In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound ˜O(√min{d,S}⋅H2SAK), where d is the number of objectives, S is the number of states, A is the number of actions, H is the length of the horizon, and K is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to ϵ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity ˜O(min{d,S}⋅H3SA/ϵ2). This result partly resolves an open problem raised by citation.

  • 多目标强化学习使用偏好来平衡目标。在实践中,偏好通常是对抗性的,例如客户的挑剔。
  • 我们将这个问题形式化为,马尔可夫决策过程上的情景学习问题,其中,转移是未知的,奖励函数是偏好向量 与预先指定的多目标奖励函数的内积。
  • 我们考虑两种设置。
    • 在在线环境中,代理每 episode 都会收到一个(对抗性)偏好,并提出与环境交互的策略。
    • 我们提供了一种基于模型(model-based)的算法,该算法实现了几乎最小最大的最优后悔界~O(√min{d,S}·H2SAK),其中 d 是目标的数量,S 是状态的数量,A 是动作的数量,H是 episode 长度,K 是事件的数量。
    • 此外,我们考虑了无偏好探索,即,代理首先与环境交互,而不指定任何偏好,然后能够适应任意的偏好向量,最大误差为 ε。
    • 我们提出的算法是可证明有效的,具有近似最优的轨迹复杂度~O(min{d,S}·H3SA/õ2)。

(纯理论且 model-based 工作,暂且不看…

Offline Constrained Multi-Objective Reinforcement Learning via Pessimistic Dual Value Iteration (NeurIPS)

In constrained multi-objective RL, the goal is to learn a policy that achieves the best performance specified by a multi-objective preference function under a constraint. We focus on the offline setting where the RL agent aims to learn the optimal policy from a given dataset. This scenario is common in real-world applications where interactions with the environment are expensive and the constraint violation is dangerous. For such a setting, we transform the original constrained problem into a primal-dual formulation, which is solved via dual gradient ascent. Moreover, we propose to combine such an approach with pessimism to overcome the uncertainty in offline data, which leads to our Pessimistic Dual Iteration (PEDI). We establish upper bounds on both the suboptimality and constraint violation for the policy learned by PEDI based on an arbitrary dataset, which proves that PEDI is provably sample efficient. We also specialize PEDI to the setting with linear function approximation. To the best of our knowledge, we propose the first provably efficient constrained multi-objective RL algorithm with offline data without any assumption on the coverage of the dataset.

  • background:
    • 在受约束的多目标 RL 中,目标是去学习一个,在约束条件下,实现多目标的偏好函数指定的,最佳性能的策略。
    • 我们专注于 offline setting;环境交互代价高昂,违反约束也很危险。
  • method:
    • 将原始约束问题转化为原始对偶公式,该公式通过对偶梯度上升来求解。
    • 建议将这种方法与悲观主义相结合,以克服离线数据中的不确定性,这导致了我们的悲观双重迭代(PEDI)。
  • 理论:
    • 基于任意数据集建立了 PEDI 学习的策略的次优性和约束违反的上界,证明了 PEDI 是可证明的样本有效的。
    • 我们还将 PEDI 专门化为具有线性函数近似的设置。(?)
  • 贡献:据我们所知,我们提出了第一个具有离线数据的可证明有效的约束多目标 RL 算法,而不假设数据集的覆盖范围。

(感觉理论很强,并且可能是 offline 的某些理论;没找到实验,没细找)

🎯 2022

Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning (NeurIPS)

We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve ~O(1/T) global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.

  • 研究具有多个奖励值函数的 MDP 的策略优化,这些决策过程将根据给定的标准进行联合优化,如比例公平(光滑凹标量化)、硬约束(约束 MDP)和 max-min 权衡。
  • 理论上,基于 ARNPG 框架设计的算法,实现了具有精确梯度的 ~O(1/T) 全局收敛。
  • 从经验上讲,与一些现有的基于策略梯度的方法相比,ARNPG 引导的算法在精确梯度和基于样本的场景中也表现出了优越的性能。
  • (没有很看懂它的实验 setting)

ai 帮忙读论文:

  • 该框架旨在,将高性能的一阶优化方法,系统地整合到多目标 MDPs 的策略优化算法设计中,通过在策略梯度更新中,引入锚点(anchor)和正则化来实现。
  • 理论:在 softmax 策略参数化下,ARNPG 框架下的算法能够保证全局收敛性,收敛速度为 O(1/T)。这意味着,随着迭代次数 T 的增加,算法的性能(例如,策略的价值函数)会以 O(1/T) 的速率收敛到最优解。(没有很懂)
  • 具体算法流程还没看,ai 说
  • 对于每个微观迭代步骤t(从0开始):
    • 计算策略梯度Gt。
    • 使用Fisher信息矩阵的Moore-Penrose逆来更新策略参数θ(t+1)。
    • 更新策略参数θ(t+1) = θ(t) + αFp(θ(t))†Gt。
  • 不知道如何更新锚点策略。

🎯 2023

A Toolkit for Reliable Benchmarking and Research in Multi-Objective Reinforcement Learning (NeurIPS)

Multi-objective reinforcement learning algorithms (MORL) extend standard reinforcement learning (RL) to scenarios where agents must optimize multiple---potentially conflicting---objectives, each represented by a distinct reward function. To facilitate and accelerate research and benchmarking in multi-objective RL problems, we introduce a comprehensive collection of software libraries that includes: (i) MO-Gymnasium, an easy-to-use and flexible API enabling the rapid construction of novel MORL environments. It also includes more than 20 environments under this API. This allows researchers to effortlessly evaluate any algorithms on any existing domains; (ii) MORL-Baselines, a collection of reliable and efficient implementations of state-of-the-art MORL algorithms, designed to provide a solid foundation for advancing research. Notably, all algorithms are inherently compatible with MO-Gymnasium; and(iii) a thorough and robust set of benchmark results and comparisons of MORL-Baselines algorithms, tested across various challenging MO-Gymnasium environments. These benchmarks were constructed to serve as guidelines for the research community, underscoring the properties, advantages, and limitations of each particular state-of-the-art method.

  • (i)MO-Gymnasium,快速构建新型 MORL环境,包括 20 多个环境。
  • (ii)MORL-Baselines,是最先进的 MORL 算法的可靠和高效实现的集合,旨在为推进研究提供坚实的基础。值得注意的是,所有算法都与 MO-Gymnasium 兼容;
  • (iii)一组全面而稳健的基准结果和 MORL 基线算法的比较,在各种具有挑战性的 MO-Gymnasium 环境中进行了测试。

(这一篇应该不算)Learning Dynamic Attribute-factored World Models for Efficient Multi-object Reinforcement Learning (NeurIPS)

(错误错误,检索成 multi-object RL 了)

In many reinforcement learning tasks, the agent has to learn to interact with many objects of different types and generalize to unseen combinations and numbers of objects. Often a task is a composition of previously learned tasks (e.g. block stacking).These are examples of compositional generalization, in which we compose object-centric representations to solve complex tasks. Recent works have shown the benefits of object-factored representations and hierarchical abstractions for improving sample efficiency in these settings. On the other hand, these methods do not fully exploit the benefits of factorization in terms of object attributes. In this paper, we address this opportunity and introduce the Dynamic Attribute FacTored RL (DAFT-RL) framework. In DAFT-RL, we leverage object-centric representation learning to extract objects from visual inputs. We learn to classify them into classes and infer their latent parameters. For each class of object, we learn a class template graph that describes how the dynamics and reward of an object of this class factorize according to its attributes. We also learn an interaction pattern graph that describes how objects of different classes interact with each other at the attribute level. Through these graphs and a dynamic interaction graph that models the interactions between objects, we can learn a policy that can then be directly applied in a new environment by estimating the interactions and latent parameters.We evaluate DAFT-RL in three benchmark datasets and show our framework outperforms the state-of-the-art in generalizing across unseen objects with varying attributes and latent parameters, as well as in the composition of previously learned tasks.

  • background:
    • 在许多 RL 任务中,智能体必须学会与许多不同类型的对象交互,并推广到 unseen 的对象组合和数量。通常,一个任务是先前学习的任务的组合(例如块堆叠)。这些是组合泛化的例子,其中我们组合以对象为中心的表示来解决复杂的任务。
    • 最近的工作已经显示了,对象分解表示和层次抽象,在提高这些设置中的样本效率方面的好处。另一方面,这些方法在对象属性方面,没有充分利用因式分解的好处。
  • method:
    • 在本文中,我们提出 动态属性 FacTored RL(DAFT-RL)框架。
    • 在 DAFT-RL 中,我们利用以对象为中心的表示学习从视觉输入中提取对象。我们学习将它们分类并推断它们的潜在参数。
    • 对于每一类对象,我们学习一个类模板图,该图描述了该类对象的动力学和奖励如何根据其属性进行因子分解。
    • 我们还学习了一个交互模式图,该图描述了不同类的对象如何在属性级别上相互交互。
    • 通过这些图和对对象之间的交互进行建模的动态交互图,我们可以通过估计交互和潜在参数来学习一种可以直接应用于新环境的策略。
  • 实验:我们在三个基准数据集中评估了 DAFT-RL ,并表明我们的框架在对具有不同属性和潜在参数的看不见的对象进行泛化方面,以及在先前学习的任务的组成方面,都优于最先进的框架。

Distributional Pareto-Optimal Multi-Objective Reinforcement Learning (NeurIPS)

Multi-objective reinforcement learning (MORL) has been proposed to learn control policies over multiple competing objectives with each possible preference over returns. However, current MORL algorithms fail to account for distributional preferences over the multi-variate returns, which are particularly important in real-world scenarios such as autonomous driving. To address this issue, we extend the concept of Pareto-optimality in MORL into distributional Pareto-optimality, which captures the optimality of return distributions, rather than the expectations. Our proposed method, called Distributional Pareto-Optimal Multi-Objective Reinforcement Learning~(DPMORL), is capable of learning distributional Pareto-optimal policies that balance multiple objectives while considering the return uncertainty. We evaluated our method on several benchmark problems and demonstrated its effectiveness in discovering distributional Pareto-optimal policies and satisfying diverse distributional preferences compared to existing MORL methods.

  • background:
    • 多目标强化学习(MORL)已被提出,用于学习多个竞争目标下、多个可能偏好下的控制策略。
    • gap:然而,当前的 MORL 算法未能考虑多变量回报的分布偏好,这在自动驾驶等现实世界场景中尤为重要。
  • method:
    • 为解决这个问题,我们将 MORL 中的 Pareto 最优概念扩展到分布的 Pareto 最优,它捕获了收益分布 而非期望的最优性。
    • 提出的方法称为分布帕累托最优多目标强化学习(DPMORL),能够学习,在考虑回报不确定性的情况下,平衡多个目标的 分布帕累托最优策略,即,学到前文定义的 Distributional Pareto-Optimal (DPO) 策略。
  • 实验:
    • 我们在几个基准问题上评估了我们的方法,并证明了与现有的 MORL 方法相比,它在发现分布 Pareto 最优 policy 和满足不同分布偏好方面的有效性。
    • (貌似实验环境就是 DeepSeaTreasure、FruitTree、HalCheetah 那些)

ai 读论文后,说的关键技术:

  1. 生成非线性效用函数:使用非减神经网络来参数化效用函数。通过最小化价值和梯度的最小距离,来生成一组多样化的效用函数。
  2. 优化策略:对于每个生成的效用函数,使用算法 1(Utility-based Reinforcement Learning)来优化策略。算法 1 通过扩展状态空间和转换奖励函数,将多目标问题转换为单目标问题。使用现有的强化学习算法(如 PPO 或 SAC)来优化策略。

Multi-Objective Reinforcement Learning: Convexity, Stationarity and Pareto Optimality (ICLR)

In recent years, single-objective reinforcement learning (SORL) algorithms have received a significant amount of attention and seen some strong results. However, it is generally recognized that many practical problems have intrinsic multi-objective properties that cannot be easily handled by SORL algorithms. Although there have been many multi-objective reinforcement learning (MORL) algorithms proposed, there has been little recent exploration of the fundamental properties of the spaces we are learning in. In this paper, we perform a rigorous analysis of policy induced value functions and use the insights to distinguish three views of Pareto optimality. The results imply the convexity of the induced value function's range for stationary policies and suggest that any point of its Pareto front can be achieved by training a policy using linear scalarization (LS). We show the problem that leads to the suboptimal performance of LS can be solved by adding strongly concave terms to the immediate rewards, which motivates us to propose a new vector reward-based Q-learning algorithm, CAPQL. Combined with an actor-critic formulation, our algorithm achieves state-of-the-art performance on multiple MuJoCo tasks in the preference agnostic setting. Furthermore, we empirically show that, in contrast to other LS-based algorithms, our approach is significantly more stable, achieving similar results across various random seeds.

  • background:
    • 尽管已经提出了许多 MORL 算法,但对我们正在学习的空间的基本性质 的探索很少。
  • method:
    • 在本文中,我们对 policy-induced 的价值函数进行了严格的分析,并使用这些见解来区分帕累托最优性的三种观点。
    • 结果表明,对于平稳策略,诱导值函数的范围是凸性的,并表明其 Pareto 前沿的任何点都可以通过使用线性标量化(LS)训练策略来实现。(啊这还蛮数学的,听起来又数学又清楚)(发现 review 分数是 5 6 6 8)
    • 我们证明了,可以通过在即时奖励中添加强凹项,来解决导致 LS 性能次优的问题,这促使我们提出了一种,新的基于向量奖励的 Q 学习算法 CAPQL。
  • experiment:
    • 结合 actor-critic 的 formulation,我们的算法在偏好不可知的环境中,在多个 MuJoCo 任务上实现了最先进的性能。
    • 此外,我们的经验表明,与其他基于 LS 的算法相比,我们的方法明显更稳定,在各种随机种子中实现了相似的结果。

video(我能听懂的部分)大概说了两个事:

  • 只要稍微魔改一下 reward(s,a),使用 linear preference(即 \(\boldsymbol w^T\boldsymbol r(s,a)\) )就可以得到 pareto 前沿上的所有点。(用 linear preference 做单目标 RL?)
  • 魔改方法:给每个 reward(s,a) 加一个保证凸性(?)的 intrinsic reward(可以这样说嘛),比如 奖励 policy 的 entropy(?)

PD-MORL: Preference-Driven Multi-Objective Reinforcement Learning Algorithm (ICLR)

Multi-objective reinforcement learning (MORL) approaches have emerged to tackle many real-world problems with multiple conflicting objectives by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space scalable to continuous robotic tasks. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. It also employs a novel parallelization approach to increase sample efficiency. We show that PD-MORL achieves up to 25% larger hypervolume for challenging continuous control tasks and uses an order of magnitude fewer trainable parameters compared to prior approaches.

  • background:
    • 多目标强化学习(MORL)方法已经出现,通过最大化 由偏好向量加权的 联合目标函数,来解决具有多个冲突目标的 许多现实世界问题。这些方法旨在找到,与训练期间指定的偏好向量相对应的,固定的定制策略。
    • 然而,设计约束和目标 通常在现实场景中动态变化。此外,为每个潜在偏好 存储对应的策略,是不可扩展的。
    • 因此,通过单个训练,获得给定域中整个偏好空间的一组 Pareto 前沿解,是至关重要的。
  • method:
    • 为此,提出了一种新的 MORL 算法,该算法训练单个通用网络,来覆盖可扩展到连续机器人任务的整个偏好空间。
    • 所提出的方法,偏好驱动 MORL(PD-MORL),利用偏好作为指导,来更新网络参数。(貌似证明了收敛)
    • 它还采用了一种新颖的并行化方法来提高采样效率。
  • experiment:
    • 我们表明,与现有方法相比,PD-MORL 在具有挑战性的连续控制任务中实现了高达 25% 的超体积(hyper-volume),并使用了数量级更少的可训练参数。
    • 实验环境:Deep Sea Treasure、Fruit Tree Navigation + Walker2d halfcheetah 那些。

主要贡献:

  • 它的 fixed policy 下的 bellman operator 是跟 Envelope MOQ 一样的,但是 optimal bellman operator 魔改成了基于 cosine similarity 的东西:

  • \[(\mathcal{B}\mathbf{Q})(s,a,\boldsymbol{\omega}):=\mathbf{r}(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim\mathcal{P}(\cdot|s,a)} \mathbf{Q}\bigg(s^{\prime}, \sup_{a^{\prime}\in\mathcal{A}}\big( S_c[\boldsymbol{\omega},\mathbf{Q}(s^{\prime},a^{\prime},\boldsymbol{\omega})] \cdot[\boldsymbol{\omega}^T \mathbf{Q}(s^{\prime},a^{\prime},\boldsymbol{\omega})]\big),\boldsymbol{\omega}\bigg) \]

  • 其中 \(S_c[\boldsymbol{\omega},\mathbf{Q}(s^{\prime},a^{\prime},\boldsymbol{\omega})]\) 是 w 和 Q 的 cosine similarity。这个 bellman operator 跟 \(B_\pi\) 都满足压缩映射,在(跟 Envelope MOQ 一样的)伪度量 \(d(Q,Q')=\sup_w |w^T(Q(s,a,w)-Q'(s,a,w))|\) 下。

Q-Pensieve: Boosting Sample Efficiency of Multi-Objective RL Through Memory Sharing of Q-Snapshots (ICLR)

Many real-world continuous control problems are in the dilemma of weighing the pros and cons, multi-objective reinforcement learning (MORL) serves as a generic framework of learning control policies for different preferences over objectives. However, the existing MORL methods either rely on multiple passes of explicit search for finding the Pareto front and therefore are not sample-efficient, or utilizes a shared policy network for coarse knowledge sharing among policies. To boost the sample efficiency of MORL, we propose Q-Pensieve, a policy improvement scheme that stores a collection of Q-snapshots to jointly determine the policy update direction and thereby enables data sharing at the policy level. We show that Q-Pensieve can be naturally integrated with soft policy iteration with convergence guarantee. To substantiate this concept, we propose the technique of Qreplay buffer, which stores the learned Q-networks from the past iterations, and arrive at a practical actor-critic implementation. Through extensive experiments and an ablation study, we demonstrate that with much fewer samples, the proposed algorithm can outperform the benchmark MORL methods on a variety of MORL benchmark tasks.

  • gap:
    • 然而,现有的 MORL 方法,要么依赖于多次显式搜索来找到 Pareto 前沿,因此不具有采样效率;
    • 要么利用共享策略网络,在策略之间进行粗略的知识共享。
  • method:
    • 为提高 MORL 的采样效率,我们提出了 Q-Pensieve(Q-冥想盆),这是一种策略改进方案,它存储一组 Q-snapshots,以共同确定策略更新方向,从而实现策略级别的数据共享。(这听起来好有趣)
    • 我们证明了,Q-Pensieve 可以与具有收敛保证的软策略迭代自然集成。
    • 为了证实这一概念,我们提出了 Q-replay buffer 的技术,它存储了从过去的迭代中学习到的 Q-网络,并实现了一个实用的 actor-critic 实现。(发现 review 分数也是 5 6 6 8)
  • experiment:
    • 通过广泛的实验和消融研究,我们证明,在各种 MORL 基准任务上,所提出的算法在样本少得多的情况下可以优于基准 MORL 方法。
    • 实验环境:DST2d(应该是 deep sea treasure)、LunarLander4d(4d 是什么?)、HalfCheetah2d Hopper3d 之类。

貌似大概就是,在 Envelope MOQ 的基础上,在它那一步 HQ 的 sup w' [w^TQ(s,a,w')] 中,同时也对所有 Q replay buffer 里的 Q 做 sup,即,把我们原先存的所有 Q function 拿出来看看,有没有在某个 weight w' 上表现更好的。

Scaling Pareto-Efficient Decision Making via Offline Multi-Objective RL (ICLR)

The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.

  • background:
    • 多目标强化学习(MORL)的目标,是学习同时优化多个竞争目标的策略。
    • 在实践中,代理对目标的偏好 可能不是事先已知的,因此,我们需要可以在测试时 推广到任意偏好的策略。
  • method:
    • 这项工作为 offline MORL 提出了一种新的数据驱动设置,我们希望仅使用 其他代理及其偏好的 offline 演示 的有限数据集,来学习偏好不可知的策略代理。
    • 首先,我们介绍了专门为 offline 设置设计的 D4MORL 数据集,包含 180 万个注释演示,是通过在 6 个 MuJoCo 环境中,rollout 优化随机采样的偏好的参考策略而获得的,每个环境有 2-3 个目标。
    • 其次,我们提出了 Pareto 高效决策代理(PEDA),这是一系列 offline MORL 算法,通过一种新的偏好和回报条件策略,来构建和扩展 Decision Transformer。
  • experiment:PEDA 非常接近 D4MORL 基准上的行为策略,并在适当的条件下,提供了 Pareto 前沿的极好近似,如通过超体积和稀疏性度量所测量的。

貌似主要贡献是 1. D4MORL,2. 简单的 DT 魔改(?)



posted @ 2024-05-28 22:31  MoonOut  阅读(85)  评论(0编辑  收藏  举报