In this paper, we analyze the behavior of existing techniques and design new solutions for the problem of one-shot visual imitation. In this setting, an agent must solve a novel instance of a novel task given just a single visual demonstration. Our analysis reveals that current methods fall short because of three errors: the DAgger problem arising from purely offline training, last centimeter errors in interacting with objects, and mis-fitting to the task context rather than to the actual task. This motivates the design of our modular approach where we a) separate out task inference (what to do) from task execution (how to do it), and b) develop data augmentation and generation techniques to mitigate mis-fitting. The former allows us to leverage hand-crafted motor primitives for task execution which side-steps the DAgger problem and last centimeter errors, while the latter gets the model to focus on the task rather than the task context. Our model gets 100% and 48% success rates on two recent benchmarks, improving upon the current state-of-the-art by absolute 90% and 20% respectively.
翻译:在本文中, 我们分析现有技术的行为, 并为一次性视觉仿真问题设计新的解决方案 。 在这种环境下, 代理必须解决一个新颖任务的新例子, 只给出一个视觉演示。 我们的分析显示, 目前的方法有三处错误: 纯离线训练产生的触摸器问题, 与对象互动中的最后厘米错误, 以及不适应任务背景而不是实际任务 。 这促使我们设计模块化方法, 在那里我们可以区分任务执行( 如何做) 和任务执行( 如何做) 之间的任务推论( ), 并且 b) 开发数据增强和生成技术以缓解错误。 前者允许我们利用手工制作的发动机原始件来执行任务, 以绕过 Dagger 问题和最后厘米错误, 而后者则让模型专注于任务而不是任务背景。 我们的模型在两个最近的基准上获得了100%和48%的成功率, 分别以绝对的90%和20% 来改进目前的状况 。