Many real-world problems involve multiple, possibly conflicting, objectives. Multi-objective reinforcement learning (MORL) approaches have emerged to tackle these problems by maximizing a joint objective function weighted by a preference vector. These approaches find fixed customized policies corresponding to preference vectors specified during training. However, the design constraints and objectives typically change dynamically in real-life scenarios. Furthermore, storing a policy for each potential preference is not scalable. Hence, obtaining a set of Pareto front solutions for the entire preference space in a given domain with a single training is critical. To this end, we propose a novel MORL algorithm that trains a single universal network to cover the entire preference space. The proposed approach, Preference-Driven MORL (PD-MORL), utilizes the preferences as guidance to update the network parameters. After demonstrating PD-MORL using classical Deep Sea Treasure and Fruit Tree Navigation benchmarks, we evaluate its performance on challenging multi-objective continuous control tasks.
翻译:许多现实世界问题涉及多重的、可能相互冲突的目标。多目标强化学习(MORL)方法已经出现,通过最大限度地提高一个由偏爱矢量加权的共同目标功能来解决这些问题。这些方法找到了与培训期间指定的偏爱矢量相对应的固定定制政策。然而,设计限制和目标通常在现实生活中动态变化。此外,为每一种潜在偏爱存储一项政策是无法伸缩的。因此,获得一套Pareto在特定域内全部优先空间的前沿解决方案,同时进行单一培训至关重要。为此,我们提出了一个新的MORL算法,用于培训一个覆盖整个偏爱空间的单一通用网络。拟议方法Prefer-Driven MORL(PD-MORL)将偏好用作更新网络参数的指南。在使用传统的深海三角和果树导航基准演示PD-MORL后,我们评估其在挑战多目标连续控制任务方面的绩效。