The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g. via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as distances (e.g. via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g. 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time. For visualizations of the learned policies please check: https://agi-labs.github.io/manipulate-by-seeing/
翻译:视觉代表学习领域在过去几年中出现了爆炸性增长,但在机器人方面的好处迄今却令人惊讶地有限。 先前的工作使用通用视觉表现作为学习( 特定任务)机器人行动政策( 例如通过行为克隆)的基础。 虽然视觉表现确实加快了学习, 但主要用于编解视觉观察。 因此, 行动信息必须纯粹从机器人数据中产生, 收集成本昂贵! 在这项工作中, 我们提出了一个可扩展的替代方案, 视觉表现可以帮助直接推导机器人行动。 我们观察到, 视觉编码者将图像观察作为距离( 例如通过嵌入点产品) 来表达关系, 用于有效规划机器人行为。 我们操作了这种洞察, 并开发了获取远程功能和动态预测器的简单算法, 其方法是微调人类所收集的视频序列中经过预先培训的代表。 最后的方法可以大大超出传统的机器人学习基准( 如 70% 成功 与. 等 。 50% 选择地点的行为克隆 ) 。 我们观察到, 各种真实世界的操作任务中, 也可以概括到新版的游戏政策, 。 在任何机器人演示期间, 使用任何视觉演示中, 。</s>