The goal of imitation learning is to mimic expert behavior from demonstrations, without access to an explicit reward signal. A popular class of approach infers the (unknown) reward function via inverse reinforcement learning (IRL) followed by maximizing this reward function via reinforcement learning (RL). The policies learned via these approaches are however very brittle in practice and deteriorate quickly even with small test-time perturbations due to compounding errors. We propose Imitation with Planning at Test-time (IMPLANT), a new meta-algorithm for imitation learning that utilizes decision-time planning to correct for compounding errors of any base imitation policy. In contrast to existing approaches, we retain both the imitation policy and the rewards model at decision-time, thereby benefiting from the learning signal of the two components. Empirically, we demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments and excels at zero-shot generalization when subject to challenging perturbations in test-time dynamics.
翻译:模仿学习的目的是模仿示范中的专家行为,而没有明确的奖赏信号。一种流行的方法类别通过反强化学习(IRL)推断出(未知的)奖赏功能,然后通过强化学习(RL)最大限度地发挥这一奖赏功能。不过,通过这些方法所学的政策在实践中非常糟糕,而且迅速恶化,即使由于复合错误造成的试验时间小幅扰动也如此。我们提议在试验时与规划(IMPLANT)进行模仿,这是一种用于模仿学习的新的元值学,利用决策时间规划来纠正任何基础模仿政策的错误。与现有的方法不同,我们在决策时保留模仿政策和奖赏模式,从而从两个组成部分的学习信号中受益。我们生动地表明,IMPLANT在标准控制环境中大大超越了基准模仿学习方法,在测试时间动态中受到干扰时优于零光化。