In this paper we consider multi-objective reinforcement learning where the objectives are balanced using preferences. In practice, the preferences are often given in an adversarial manner, e.g., customers can be picky in many applications. We formalize this problem as an episodic learning problem on a Markov decision process, where transitions are unknown and a reward function is the inner product of a preference vector with pre-specified multi-objective reward functions. In the online setting, the agent receives a (adversarial) preference every episode and proposes policies to interact with the environment. We provide a model-based algorithm that achieves a regret bound $\widetilde{\mathcal{O}}\left({\sqrt{\min\{d,S\}\cdot H^3 SAK}}\right)$, where $d$ is the number of objectives, $S$ is the number of states, $A$ is the number of actions, $H$ is the length of the horizon, and $K$ is the number of episodes. Furthermore, we consider preference-free exploration, i.e., the agent first interacts with the environment without specifying any preference and then is able to accommodate arbitrary preference vectors up to $\epsilon$ error. Our proposed algorithm is provably efficient with a nearly optimal sample complexity $\widetilde{\mathcal{O}}\left({\frac{\min\{d,S\}\cdot H^4 SA}{\epsilon^2}}\right)$.
翻译:在本文中,我们考虑利用偏好平衡目标的多目标强化学习;在实践中,偏好往往是以对抗性的方式给予的,例如,客户可以在许多应用中挑挑剔。我们在Markov决定过程中将这一问题正式化为一个附带学习问题,在这个过程中,过渡是未知的,奖励功能是带有预先规定的多目标奖励功能的偏好矢量的内在产物。在网上设置中,代理人得到(对抗性)偏好每个事件,并提议与环境互动的政策。我们提供一种基于模型的算法,实现一个以宽度为约束的遗憾 $(宽度 )\ mathcal{O ⁇ left{O ⁇ left(sqrt=min ⁇ d,S ⁇ cddH}3 SAK ⁇ right) $(美元),其中美元是目标数目的分数,美元是州数,美元是行动的数量,美元是地平线的长度,美元是事件的数量。此外,我们考虑无优惠探索,也就是说,代理人首先与环境互动,而没有具体说明任何优惠,然后几乎能够任意地标度性地选择整个矢值。