We present an algorithm for Inverse Reinforcement Learning (IRL) from expert state observations only. Our approach decouples reward modelling from policy learning, unlike state-of-the-art adversarial methods which require updating the reward model during policy search and are known to be unstable and difficult to optimize. Our method, IL-flOw, recovers the expert policy by modelling state-state transitions, by generating rewards using deep density estimators trained on the demonstration trajectories, avoiding the instability issues of adversarial methods. We demonstrate that using the state transition log-probability density as a reward signal for forward reinforcement learning translates to matching the trajectory distribution of the expert demonstrations, and experimentally show good recovery of the true reward signal as well as state of the art results for imitation from observation on locomotion and robotic continuous control tasks.
翻译:我们从专家国家观察中提出反强化学习算法(IRL ) 。 我们的方法将奖赏建模与政策学习脱钩,不同于最先进的对抗性方法,后者要求在政策搜索期间更新奖赏模式,已知是不稳定和难以优化的。 我们的方法,即IL-FlOw,通过模拟州与州之间的过渡,利用在示范轨迹上受过训练的深度密度估测员来创造奖赏,避免对抗性方法的不稳定问题,从而恢复专家政策。 我们证明,利用国家过渡性逻辑概率密度作为前瞻性强化学习的奖赏信号,可以与专家演示的轨迹分布相匹配,实验性地显示,真正的奖赏信号得到了良好的恢复,并表明通过观察移动和机器人连续控制任务取得模仿的艺术结果。