In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. We consider two settings. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a nearly minimax optimal regret bound $\widetilde{\mathcal{O}}\bigl(\sqrt{\min\{d,S\}\cdot H^2 SAK}\bigr)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vector up to $\epsilon$ error. Our proposed algorithm is provably efficient with a nearly optimal trajectory complexity $\widetilde{\mathcal{O}}\bigl({\min\{d,S\}\cdot H^3 SA}/{\epsilon^2}\bigr)$. This result partly resolves an open problem raised by \citet{jin2020reward}.
翻译:在本文中, 我们考虑多目标强化学习, 目标使用偏好是平衡的。 实际上, 偏好通常是以对抗性的方式给予的, 例如, 客户可以在许多应用程序中挑挑剔。 我们将此问题正式化为Markov 决策程序中的附带学习问题, 其中过渡未知, 奖赏功能是带有预先指定的多目标奖赏功能的偏好矢量的内在产物。 我们考虑两种设置。 在网上设置中, 代理人每集都得到一个( 对抗性) 偏好, 并提议与环境互动的政策。 此外, 我们提供一种基于模型的算法, 实现近乎小型最佳的遗憾约束 $\ 全局性 ; (sqrtrt_ liflical) {Obigl (sqrt_ mination_ 3, S\cdocdot H% 2 SAK_biger)$, 其中美元是目标数, 美元是州数, $A$是行动的数量, 20美元是地平线的长度, 美元是一则数。 。 此外, 我们认为, 免费的探索的偏差会与任何任意的偏差 。