Inverse reinforcement learning attempts to reconstruct the reward function in a Markov decision problem, using observations of agent actions. As already observed in Russell [1998] the problem is ill-posed, and the reward function is not identifiable, even under the presence of perfect information about optimal behavior. We provide a resolution to this non-identifiability for problems with entropy regularization. For a given environment, we fully characterize the reward functions leading to a given policy and demonstrate that, given demonstrations of actions for the same reward under two distinct discount factors, or under sufficiently different environments, the unobserved reward can be recovered up to a constant. We also give general necessary and sufficient conditions for reconstruction of time-homogeneous rewards on finite horizons, and for action-independent rewards, generalizing recent results of Kim et al. [2021] and Fu et al. [2018].
翻译:反向强化学习尝试在Markov决策问题中重建奖赏功能,利用对代理人行为的观察。正如在Russell[1998] 中已经观察到的,问题并不严重,奖励功能无法识别,即使存在关于最佳行为的完美信息,我们为这种无法识别的催化正规化问题提供了解决办法。在特定环境中,我们充分描述导致特定政策的奖励功能,并表明,鉴于在两种不同的折扣因素下或在足够不同的环境中为同样的奖赏示范行动,未观察到的奖赏可以回收到一个常数。我们还为在有限视野上重建时间组合式的奖励和基于行动奖励提供普遍必要和充分的条件,并推广Kim等人(2021年)和Fu等人(2018年)最近的成果。