Imitation learning holds tremendous promise in learning policies efficiently for complex decision making problems. Current state-of-the-art algorithms often use inverse reinforcement learning (IRL), where given a set of expert demonstrations, an agent alternatively infers a reward function and the associated optimal policy. However, such IRL approaches often require substantial online interactions for complex control problems. In this work, we present Regularized Optimal Transport (ROT), a new imitation learning algorithm that builds on recent advances in optimal transport based trajectory-matching. Our key technical insight is that adaptively combining trajectory-matching rewards with behavior cloning can significantly accelerate imitation even with only a few demonstrations. Our experiments on 20 visual control tasks across the DeepMind Control Suite, the OpenAI Robotics Suite, and the Meta-World Benchmark demonstrate an average of 7.8X faster imitation to reach 90% of expert performance compared to prior state-of-the-art methods. On real-world robotic manipulation, with just one demonstration and an hour of online training, ROT achieves an average success rate of 90.1% across 14 tasks.
翻译:光学学习在学习政策中为复杂的决策问题带来了巨大的希望。 目前的先进算法经常使用反强化学习(IRL ), 给一组专家演示, 一个代理商或推论奖励功能和相关的最佳政策。 然而,这类IRL 方法往往需要大量的在线互动, 解决复杂的控制问题。 在这项工作中, 我们展示了一种基于基于最佳交通轨迹匹配的最新进展的新型仿真学习算法(ROT ) 。 我们的关键技术洞察力是, 将轨迹匹配奖励与行为克隆相匹配的适应性方法可以大大加速模仿。 我们在深明控制套、 OpenAI 机器人套件和Meta-World基准的20项视觉控制任务实验显示,与以往的先进方法相比,平均有7.8X的模仿率达到专家业绩的90%。 在现实世界机器人操纵方面,只有一次演示和一小时的在线培训,ROT在14项任务中取得了平均成功率90.1%。