Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones -- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of Adversarial Imitation Learning algorithms by removing the Reinforcement Learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent Imitation Learning methods.
翻译:学习歧视者 -- -- 这可以分辨专家的演示和生成的演示 -- -- 和产生者的政策以产生能够愚弄该歧视者的轨迹。 这种替代优化在实践上是微妙的,因为它使不稳定的对抗性培训与微软和样本效率低下的强化学习相结合。 我们提议通过利用新的区别性配方来消除政策优化步骤的负担。 具体地说,我们的歧视者明确以两种政策为条件:一种政策是前产生者迭代和可学习的政策。在优化后,这个歧视者直接学习最佳的生成者的政策。因此,我们的歧视者的最新更新免费解决了发电机的优化问题:学习一种仿照专家的政策不需要额外的优化循环。我们提议通过完全消除强化学习阶段,通过实施和计算负担来有效削减政策优化步骤的负担。 我们展示了我们较简单的方法对于流行的消化学习方法具有竞争力。