Visual imitation learning provides an effective framework to learn skills from demonstrations. However, the quality of the provided demonstrations usually significantly affects the ability of an agent to acquire desired skills. Therefore, the standard visual imitation learning assumes near-optimal demonstrations, which are expensive or sometimes prohibitive to collect. Previous works propose to learn from noisy demonstrations; however, the noise is usually assumed to follow a context-independent distribution such as a uniform or gaussian distribution. In this paper, we consider another crucial yet underexplored setting -- imitation learning with task-irrelevant yet locally consistent segments in the demonstrations (e.g., wiping sweat while cutting potatoes in a cooking tutorial). We argue that such noise is common in real world data and term them "extraneous" segments. To tackle this problem, we introduce Extraneousness-Aware Imitation Learning (EIL), a self-supervised approach that learns visuomotor policies from third-person demonstrations with extraneous subsequences. EIL learns action-conditioned observation embeddings in a self-supervised manner and retrieves task-relevant observations across visual demonstrations while excluding the extraneous ones. Experimental results show that EIL outperforms strong baselines and achieves comparable policies to those trained with perfect demonstration on both simulated and real-world robot control tasks. The project page can be found at https://sites.google.com/view/eil-website.
翻译:视觉模拟学习为从演示中学习技能提供了一个有效的框架。然而,所提供的演示的质量通常会大大影响代理人获得所需技能的能力。因此,标准的视觉模拟学习假设了接近最佳的演示,这些展示费用昂贵,有时令人无法收集。以前的作品建议从吵闹的演示中学习;然而,噪音通常被假定为遵循一种不受背景影响的分布方式,如制服或粗便的分布方式。在本文中,我们认为另一个关键但探索不足的场景 -- -- 模仿与任务相关的但与当地一致的场景 -- -- 模仿与任务相关的部分(例如,在烹饪辅导中擦汗和切除土豆)。我们争辩说,这种噪音在现实世界数据中很常见,术语是“极端”的。为了解决这一问题,我们引入了超常性-视觉模拟学习(EIL),这是一种自我监督的方法,从具有不相干的次等结果的第三人演示中学习了相对运动政策。ELL学会学习了以自我监督的方式将观察结果嵌入,并检索了真实的网络观测结果,同时在视觉演示中将那些经过训练的模型模拟的模型展示结果排除。