多目标学习中泛化策略改进优先级的样本有效方法 (Sample-Efficient Multi-Objective Learning via Generalized Policy Improvement Prioritization)

Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state and action spaces.

翻译：多目标强化学习算法解决了代理在可能存在冲突的奖励函数上具有不同偏好的顺序决策问题。这种算法通常学习一组策略（每个策略都为特定代理偏好进行了优化），这些策略随后可以用于解决具有新偏好的问题。我们引入了一种新的算法，使用泛化策略改进（GPI）来定义原则性且经过正式推导的优先级方案，从而提高了样本的有效性。它们通过以下主动学习策略实现：代理可以（i）识别每个时刻最有希望的偏好/目标进行训练，以更快地解决给定的多目标强化学习问题；和（ii）识别在为特定代理偏好学习策略时最相关的先前经验，通过一种新颖的类 Dyna 的多目标强化学习方法。我们证明了我们的算法保证在有限步数内总是收敛到最优解，或者如果代理受限制只能识别可能不太优的策略，则获得 $\epsilon$-最优解（对于有界的 $ \epsilon $）。我们还证明了我们的方法在学习时单调地改善其部分解的质量。最后，我们引入了一种约束，该约束表征了在学习过程中我们的方法计算的部分解产生的与最优解的最大效用损失。我们通过实验证明，在具有离散和连续状态以及行动的具有挑战性的多目标任务中，我们的方法优于现有的多目标强化学习算法。