Multi-objective reinforcement learning (MORL) algorithms tackle sequential decision problems where agents may have different preferences over (possibly conflicting) reward functions. Such algorithms often learn a set of policies (each optimized for a particular agent preference) that can later be used to solve problems with novel preferences. We introduce a novel algorithm that uses Generalized Policy Improvement (GPI) to define principled, formally-derived prioritization schemes that improve sample-efficient learning. They implement active-learning strategies by which the agent can (i) identify the most promising preferences/objectives to train on at each moment, to more rapidly solve a given MORL problem; and (ii) identify which previous experiences are most relevant when learning a policy for a particular agent preference, via a novel Dyna-style MORL method. We prove our algorithm is guaranteed to always converge to an optimal solution in a finite number of steps, or an $\epsilon$-optimal solution (for a bounded $\epsilon$) if the agent is limited and can only identify possibly sub-optimal policies. We also prove that our method monotonically improves the quality of its partial solutions while learning. Finally, we introduce a bound that characterizes the maximum utility loss (with respect to the optimal solution) incurred by the partial solutions computed by our method throughout learning. We empirically show that our method outperforms state-of-the-art MORL algorithms in challenging multi-objective tasks, both with discrete and continuous state spaces.
翻译:多目标强化学习(MORL)算法处理连续决策问题,使代理商对(可能相互冲突)奖励功能有不同的偏好。这种算法往往学习一套政策(每个最优化于特定代理商偏好),然后用来解决新偏好的问题。我们引入了一种新型的算法,利用通用政策改进(GPI)来界定原则性、正规的优先排序计划,从而改进抽样效率学习。它们实施主动学习战略,使代理商能够(一)确定在每一时刻培训的最有希望的偏好/目标,以便更迅速地解决给定的MORL问题;以及(二)在学习特定代理商偏好的政策时,往往会发现哪些以往的经验最为相关。我们证明我们的算法总是在有限的步骤中达到最佳解决方案,或者用美元-最优的解决方案(对于受约束的 $\pluslon$),如果代理商有限,并且只能确定可能的次最佳的策略。我们还证明我们的方法是单调的,在学习其部分成本方法中,我们通过学习最优的方法,然后学习最优的方法。