CLIPort:机器人操纵的路径和途径 (CLIPort: What and Where Pathways for Robotic Manipulation)

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.

翻译：我们怎样才能让机器人精准地操作物体,但也能从抽象概念的角度理解这些物体?最近操纵的工程显示,端对端网络可以学习需要精确空间推理的广度技能,但这些方法往往不能向新目标推广,或迅速学习跨任务可转移的概念。与此同时,通过大规模互联网数据培训,在学习可通用的视觉和语言语义表达方式方面取得了巨大进展,然而,这些表达方式缺乏精细操作所需的空间理解。为此,我们提议了一个将两个世界的最佳组合起来的框架:一个双流结构,配有基于愿景的操纵所需的语义和空间政策路径。具体地说,我们介绍CLIPort,一个以语言为条件的模仿学习工具,将CLIP [1] 的广泛语义理解(什么)与运输器的空间精确度[2] 结合起来。我们的端对端框架能够解决从将看不见的物体包装到叠合布的多种语言定义的表格任务,所有内容都不具有任何明确的可比较的图像结构。我们在真实的、像形的9级、模拟的模型和图像结构中展示了我们所看到的10个真实的图像和数字数据状态。