This paper considers learning robot locomotion and manipulation tasks from expert demonstrations. Generative adversarial imitation learning (GAIL) trains a discriminator that distinguishes expert from agent transitions, and in turn use a reward defined by the discriminator output to optimize a policy generator for the agent. This generative adversarial training approach is very powerful but depends on a delicate balance between the discriminator and the generator training. In high-dimensional problems, the discriminator training may easily overfit or exploit associations with task-irrelevant features for transition classification. A key insight of this work is that performing imitation learning in a suitable latent task space makes the training process stable, even in challenging high-dimensional problems. We use an action encoder-decoder model to obtain a low-dimensional latent action space and train a LAtent Policy using Adversarial imitation Learning (LAPAL). The encoder-decoder model can be trained offline from state-action pairs to obtain a task-agnostic latent action representation or online, simultaneously with the discriminator and generator training, to obtain a task-aware latent action representation. We demonstrate that LAPAL training is stable, with near-monotonic performance improvement, and achieves expert performance in most locomotion and manipulation tasks, while a GAIL baseline converges slower and does not achieve expert performance in high-dimensional environments.
翻译:本文审视了专家演示过程中的学习机器人运动和操纵任务。 生成的对抗性模拟学习(GAIL)培训了一位区分专家与代理人过渡的专家的区分师, 并反过来利用歧视者产出定义的奖励优化了该代理人的政策产生者。 这种基因化的对抗性培训方法非常有力,但取决于歧视者与发电机培训之间的微妙平衡。 在高层次问题中, 歧视者培训可以很容易地超越或利用具有与任务相关的过渡分类特征的协会。 这项工作的一个重要洞察力是,在合适的潜在任务空间进行模仿性学习使培训过程稳定, 甚至在挑战高层面问题中也是如此。 我们使用一个行动编码- 解码器模式来获得低维潜行动空间, 并用Aversari模拟学习(LAPAL)来培训一个Latent 政策。 在高级行动配对中, 编码- 解码器模式可以被培训离线, 以获得任务- 不可想象的潜在行动代表, 或者在线, 与歧视者和发电机培训同时, 获得任务- 隐性行动代表, 获得任务- 隐性行动代表。 我们证明LAPAL 在高级操作环境中取得最稳定、 和高水平的绩效, 在高水平环境中, 和高级操作环境中,我们取得最稳定的业绩- 和最稳定、 和最稳定的业绩- 和最慢的操作化的操作环境中, 水平上, 和最稳定和最慢化的操作性工作是稳定和最稳定的业绩- 。