The goal of multi-objective reinforcement learning (MORL) is to learn policies that simultaneously optimize multiple competing objectives. In practice, an agent's preferences over the objectives may not be known apriori, and hence, we require policies that can generalize to arbitrary preferences at test time. In this work, we propose a new data-driven setup for offline MORL, where we wish to learn a preference-agnostic policy agent using only a finite dataset of offline demonstrations of other agents and their preferences. The key contributions of this work are two-fold. First, we introduce D4MORL, (D)atasets for MORL that are specifically designed for offline settings. It contains 1.8 million annotated demonstrations obtained by rolling out reference policies that optimize for randomly sampled preferences on 6 MuJoCo environments with 2-3 objectives each. Second, we propose Pareto-Efficient Decision Agents (PEDA), a family of offline MORL algorithms that builds and extends Decision Transformers via a novel preference-and-return-conditioned policy. Empirically, we show that PEDA closely approximates the behavioral policy on the D4MORL benchmark and provides an excellent approximation of the Pareto-front with appropriate conditioning, as measured by the hypervolume and sparsity metrics.
翻译:----
本文的目标是在多个竞争性目标上学习同时优化的策略。在实践中,我们不一定了解代理的优先事项,因此,我们需要能够在测试时推广到任意优先事项的策略。在这项工作中,我们提出了一种新的数据驱动离线多目标强化学习设置,我们希望仅使用其他代理和其偏好的有限数据集来学习无偏好的策略代理。这项工作的主要贡献有两个。首先,我们介绍了 D4MORL,专门设计用于离线设置的(MORL)数据集。它包含 1.8 百万个注释示范,这些示范通过滚动优化 6 个 MuJoCo 环境中每个环境的 2-3 个目标的参考策略来获得。其次,我们提出了 Pareto-Efficient Decision Agents(PEDA),一种离线多目标强化学习算法。它通过一种新颖的以偏好和回报条件化的策略来构建和扩展 Decision Transformers。实证上,我们展示了 PEDA 在 D4MORL 基准测试中紧密逼近行为策略,并且在适当的条件下通过体积和稀疏度度量提供 Pareto 前沿的优秀逼近。