Most existing imitation learning approaches assume the demonstrations are drawn from experts who are optimal, but relaxing this assumption enables us to use a wider range of data. Standard imitation learning may learn a suboptimal policy from demonstrations with varying optimality. Prior works use confidence scores or rankings to capture beneficial information from demonstrations with varying optimality, but they suffer from many limitations, e.g., manually annotated confidence scores or high average optimality of demonstrations. In this paper, we propose a general framework to learn from demonstrations with varying optimality that jointly learns the confidence score and a well-performing policy. Our approach, Confidence-Aware Imitation Learning (CAIL) learns a well-performing policy from confidence-reweighted demonstrations, while using an outer loss to track the performance of our model and to learn the confidence. We provide theoretical guarantees on the convergence of CAIL and evaluate its performance in both simulated and real robot experiments. Our results show that CAIL significantly outperforms other imitation learning methods from demonstrations with varying optimality. We further show that even without access to any optimal demonstrations, CAIL can still learn a successful policy, and outperforms prior work.
翻译:现有大多数模拟学习方法假定示范活动是从最佳专家那里抽取的,但放松这一假设使我们能够使用更广泛的数据。标准模拟学习可能从不同最佳度的示范活动中学习一种次优政策。标准模拟学习可能从不同最佳度的示范活动中学习一种次优政策。以前的工作利用信心评分或排名从不同最佳度的示范活动中获取有益的信息,但是它们受到许多限制,例如手动附加说明的信任评分或高平均最佳度的示范活动。在本文件中,我们提出了一个从不同最佳度的演示中学习的一般框架,这些示范活动可以共同学习信心得分和良好表现的政策。我们的方法,即信任-软件模拟学习(CAIL)从信心加权示范活动中学习一种表现良好的政策,同时利用外部损失来跟踪模型的表现和学习信心。我们为CAIL的趋同提供了理论保证,并在模拟和真正的机器人实验中评价其表现。我们的结果显示,CAIL大大优于不同最佳度的示范活动的其他模仿学习方法。我们进一步表明,即使没有机会,CAIL仍然可以学习成功的政策,并且超越先前的工作。