How well do reward functions learned with inverse reinforcement learning (IRL) generalize? We illustrate that state-of-the-art IRL algorithms, which maximize a maximum-entropy objective, learn rewards that overfit to the demonstrations. Such rewards struggle to provide meaningful rewards for states not covered by the demonstrations, a major detriment when using the reward to learn policies in new situations. We introduce BC-IRL a new inverse reinforcement learning method that learns reward functions that generalize better when compared to maximum-entropy IRL approaches. In contrast to the MaxEnt framework, which learns to maximize rewards around demonstrations, BC-IRL updates reward parameters such that the policy trained with the new reward matches the expert demonstrations better. We show that BC-IRL learns rewards that generalize better on an illustrative simple task and two continuous robotic control tasks, achieving over twice the success rate of baselines in challenging generalization settings.
翻译:基于逆强化学习(IRL)的奖励函数学习对于不同场景的泛化效果如何?本文证明了目前最优秀的IRL算法(最大熵IRL)学习到的奖励函数对演示过程中的数据存在过拟合问题。这样的奖励函数无法为未被演示覆盖到的状态提供有意义的奖励,这是使用奖励函数在新场景中学习策略时的主要问题。本研究提出了一种新的逆强化学习方法BC-IRL,该方法学习到的奖励函数能够相较于最大熵IRL方法更好地进行泛化。与MaxEnt框架不同,该方法更新了奖励参数,使利用新奖励训练的策略更接近于领域专家的演示。我们证明了,相对于基线在困难的泛化场景下,BC-IRL学习到的奖励函数在区别简单任务和两个连续机械控制任务时,能够实现基线的两倍以上的成功率。