Multi-objective reinforcement learning (MORL) is an extension of ordinary, single-objective reinforcement learning (RL) that is applicable to many real-world tasks where multiple objectives exist without known relative costs. We study the problem of single policy MORL, which learns an optimal policy given the preference of objectives. Existing methods require strong assumptions such as exact knowledge of the multi-objective Markov decision process, and are analyzed in the limit of infinite data and time. We propose a new algorithm called model-based envelop value iteration (EVI), which generalizes the enveloped multi-objective $Q$-learning algorithm in Yang et al., 2019. Our method can learn a near-optimal value function with polynomial sample complexity and linear convergence speed. To the best of our knowledge, this is the first finite-sample analysis of MORL algorithms.
翻译:多目标强化学习(MORL)是普通的、单一目标强化学习(RL)的延伸,适用于许多现实世界的任务,在这些任务中,存在多个目标而没有已知的相对成本。我们研究单项政策MORL的问题,根据目标的偏好学习最佳政策。现有方法需要强有力的假设,如对多目标Markov决定过程的确切了解,并在无限数据和时间的限度内加以分析。我们提出一种新的算法,称为基于模型的隐形值迭代(EVI),在杨等人(Yang et al., 2019) 中将包装的多目标$Q-学习算法加以概括。我们的方法可以学习接近最佳的价值函数,以多元样本复杂性和线性趋同速度。据我们所知,这是对MOL算法的首次有限抽样分析。