The aim of Inverse Reinforcement Learning (IRL) is to infer a reward function $R$ from a policy $\pi$. To do this, we need a model of how $\pi$ relates to $R$. In the current literature, the most common models are optimality, Boltzmann rationality, and causal entropy maximisation. One of the primary motivations behind IRL is to infer human preferences from human behaviour. However, the true relationship between human preferences and human behaviour is much more complex than any of the models currently used in IRL. This means that they are misspecified, which raises the worry that they might lead to unsound inferences if applied to real-world data. In this paper, we provide a mathematical analysis of how robust different IRL models are to misspecification, and answer precisely how the demonstrator policy may differ from each of the standard models before that model leads to faulty inferences about the reward function $R$. We also introduce a framework for reasoning about misspecification in IRL, together with formal tools that can be used to easily derive the misspecification robustness of new IRL models.
翻译:反强化学习(IRL) 的目的是从政策 $\ pion 美元 中推断一个奖赏函数 $R$。 要做到这一点,我们需要一个美元与美元的关系模型。 在目前的文献中,最常见的模型是最佳性、博尔茨曼理性和因果激化最大化。 IRL的主要动机之一是从人类行为中推断人类偏好。 然而, 人类偏好和人类行为之间的真实关系比IRL目前使用的任何模型都复杂得多。 这意味着它们被错误地描述, 这使得人们担心如果应用到现实世界的数据, 可能导致无法准确推断的错误。 在本文中, 我们提供了对不同IRL模型的精确度分析, 并准确地回答示范器政策可能与每个标准模型的不同之处, 导致对奖赏功能的错误推论。 我们还引入了在IRL 中解释错误的推论框架, 以及可以轻易地使用正式工具来得出新的IRL 的错误精确度。