Multi-objective reinforcement learning (MORL) is an extension of ordinary, single-objective reinforcement learning (RL) that is applicable to many real world tasks where multiple objectives exist without known relative costs. We study the problem of single policy MORL, which learns an optimal policy given the preference of objectives. Existing methods require strong assumptions such as exact knowledge of the multi-objective Markov decision process, and are analyzed in the limit of infinite data and time. We propose a new algorithm called model-based envelop value iteration (EVI), which generalizes the enveloped multi-objective $Q$-learning algorithm in Yang, 2019. Our method can learn a near-optimal value function with polynomial sample complexity and linear convergence speed. To the best of our knowledge, this is the first finite-sample analysis of MORL algorithms.
翻译:多目标强化学习(MORL)是普通的、单一目标的强化学习(RL)的延伸,适用于许多现实世界的任务,在这些任务中,存在多个目标,而没有已知的相对成本。我们研究单项政策MORL的问题,根据目标的偏好学习最佳政策。现有方法需要强有力的假设,如对多目标的Markov决定过程的确切了解,并在无限数据和时间的限度内加以分析。我们提出了一个新的算法,称为基于模型的隐形值迭代(EVI),它概括了2019年杨的包装多目标的$Q-学习算法。我们的方法可以学习接近最佳的价值函数,具有多元样本复杂性和线性趋同速度。据我们所知,这是对MORL算法的第一次有限抽样分析。