In many real-world scenarios, the utility of a user is derived from the single execution of a policy. In this case, to apply multi-objective reinforcement learning, the expected utility of the returns must be optimised. Various scenarios exist where a user's preferences over objectives (also known as the utility function) are unknown or difficult to specify. In such scenarios, a set of optimal policies must be learned. However, settings where the expected utility must be maximised have been largely overlooked by the multi-objective reinforcement learning community and, as a consequence, a set of optimal solutions has yet to be defined. In this paper we address this challenge by proposing first-order stochastic dominance as a criterion to build solution sets to maximise expected utility. We also propose a new dominance criterion, known as expected scalarised returns (ESR) dominance, that extends first-order stochastic dominance to allow a set of optimal policies to be learned in practice. We then define a new solution concept called the ESR set, which is a set of policies that are ESR dominant. Finally, we define a new multi-objective distributional tabular reinforcement learning (MOT-DRL) algorithm to learn the ESR set in a multi-objective multi-armed bandit setting.
翻译:在许多现实世界的情景中,用户的效用来自单项政策的执行。在这种情况下,为了应用多目标强化学习,必须优化回报的预期效用。在用户对目标的偏好(也称为公用事业功能)并不为人所知或难以具体说明的各种假设中,用户对目标的偏好(也称为公用事业功能)并不为人所知或难以具体说明。在这种情况下,必须学习一套最佳的政策。然而,必须最大限度地发挥预期效用的设置,被多目标强化学习界大都忽视,因此,一套最佳解决办法尚未确定。在本文件中,我们提出第一个等级的随机优势作为构建解决方案的标准来应对这一挑战,以最大限度地发挥预期效用。我们还提出了一个新的主导性标准,即预期的斜度回报(ESR)占优势,以扩展第一等级的随机偏差主导地位,从而能够在实践中学习一套最佳政策。我们随后定义了一套新的解决方案概念,即ESR集,这是一套占主导地位的政策。最后,我们界定了一套新的多目标分布式列表强化标准,作为构建最大预期效用的标准。我们还提出了一个新的主导标准,称为预期的斜度回报(MOT-DR)占优势,用以学习多频段的磁段矩阵。