Inverse Reinforcement Learning (IRL) is attractive in scenarios where reward engineering can be tedious. However, prior IRL algorithms use on-policy transitions, which require intensive sampling from the current policy for stable and optimal performance. This limits IRL applications in the real world, where environment interactions can become highly expensive. To tackle this problem, we present Off-Policy Inverse Reinforcement Learning (OPIRL), which (1) adopts off-policy data distribution instead of on-policy and enables significant reduction of the number of interactions with the environment, (2) learns a stationary reward function that is transferable with high generalization capabilities on changing dynamics, and (3) leverages mode-covering behavior for faster convergence. We demonstrate that our method is considerably more sample efficient and generalizes to novel environments through the experiments. Our method achieves better or comparable results on policy performance baselines with significantly fewer interactions. Furthermore, we empirically show that the recovered reward function generalizes to different tasks where prior arts are prone to fail.
翻译:反强化学习(IRL)在奖励工程可能乏味的情景中具有吸引力。然而,以前的IRL算法在政策过渡中使用了固定的奖赏功能,这种功能需要从现行政策中进行密集抽样,以便实现稳定和最佳绩效。这限制了IRL在现实世界中的应用,因为环境相互作用会变得非常昂贵。为了解决这个问题,我们提出了非政策反强化学习(OPIRL),该功能:(1) 采用非政策性数据分配,而不是政策上的数据分配,并能够大量减少与环境的互动次数;(2) 学习固定性的奖赏功能,这种功能在变化动态方面具有高度的普及能力,(3) 利用模式覆盖行为加快趋同速度。我们表明,我们的方法通过实验,其抽样效率要高得多,并且向新的环境概括。我们的方法在政策业绩基线上取得了更好或可比的结果,而互动要少得多。此外,我们的经验显示,回收的奖赏功能可以概括到以前艺术容易失败的不同任务。