Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem, using observations of agent actions. As already observed by Russell the problem is ill-posed, and the reward function is not identifiable, even under the presence of perfect information about optimal behavior. We provide a resolution to this non-identifiability for problems with entropy regularization. For a given environment, we fully characterize the reward functions leading to a given policy and demonstrate that, given demonstrations of actions for the same reward under two distinct discount factors, or under sufficiently different environments, the unobserved reward can be recovered up to a constant. Through a simple numerical experiment, we demonstrate the accurate reconstruction of the reward function through our proposed resolution.
翻译:反向强化学习试图在Markov决策问题中重建奖赏功能,使用对代理人行为的观察。 Russell已经注意到,问题存在弊端,奖励功能无法识别,即使存在关于最佳行为的完美信息。我们为无法识别的催化正规化问题提供了一个解决方案。对于特定环境,我们充分描述导致特定政策的奖赏功能,并表明,鉴于在两种不同的折扣因素下或在相当不同的环境中为同样的奖赏示范行动,未观察到的奖赏可以回收到一个常数。我们通过简单的数字实验,通过我们拟议的决议,展示了奖赏功能的准确重建。