Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can manipulation still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by ``detecting the next best voxel action''. Unlike frameworks that operate on 2D images, the voxelized 3D observation and action space provides a strong structural prior for efficiently learning 6-DoF actions. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
翻译:变异器已经革命化了视觉和自然语言处理, 并有能力使用大型数据集进行规模化。 但是在机器人操作中, 数据是有限和昂贵的。 操纵仍然能从设计正确的问题变异器中受益吗? 我们用PerAct调查这个问题, PerAct是多任务 6 - DoF 操纵的有语言条件的行为克隆剂。 PerAct 将语言目标和 RGB- D voxel 观测与 Perceiver 变异器编码起来, 并且通过“ 检测下一个最佳 voxel 动作” 来产生分解的行动。 与在 2D 图像上运行的框架不同, 3D 自动变异化的观测和行动空间为高效学习 6 - DoF 动作提供了强大的结构。 有了这个配方, 我们训练了一个18 RLBench 任务( 249 变异) 和 7个真实世界任务( 18 变异), 由每个任务只几个演示。 我们的结果表明, PerAcsocide 明显超越了不结构化成像到动作的图像动作代理和3D ConvNet 基线, 为广泛的桌面任务。