The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
翻译:逆强化学习(IRL)的目标是从一个策略$π$中推断出奖励函数$R$。为此,我们需要一个$\pi$与$R$之间关系的模型。在当前文献中,最常见的模型是最优性、Boltzmann合理性和因果熵最大化。IRL背后的主要动机之一是从人类行为中推断出人类偏好。然而,人类偏好与人类行为之间的真实关系比IRL当前使用的任何模型都要复杂得多。这意味着它们是错误指定的,这引发了一个担忧,即如果将其应用于现实世界的数据中,它们可能会导致不合理的推断。在本文中,我们提供了对不同IRL模型对错误指定的鲁棒性如何的数学分析,并回答了每个标准模型的展示者策略可能与之有何不同,才会导致关于奖励函数$R$的错误推断。我们还引入了一个关于IRL中错误指定的推理框架,以及可以用于轻松推导新IRL模型的错误指定鲁棒性的正式工具。