Transformers have revolutionized vision and natural language processing with their ability to scale with large datasets. But in robotic manipulation, data is both limited and expensive. Can we still benefit from Transformers with the right problem formulation? We investigate this question with PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action". Unlike frameworks that operate on 2D images, the voxelized observation and action space provides a strong structural prior for efficiently learning 6-DoF policies. With this formulation, we train a single multi-task Transformer for 18 RLBench tasks (with 249 variations) and 7 real-world tasks (with 18 variations) from just a few demonstrations per task. Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
翻译:变异器已经革命化了视觉和自然语言处理, 并有能力使用大型数据集进行规模化。 但是在机器人操作中, 数据是有限和昂贵的。 我们还能从使用正确问题配制的变异器中获益吗? 我们用PerAct调查这个问题,PerAct是多任务 6 - DoF 操纵的有语言条件的行为克隆剂。 PerAct 将语言目标和 RGB- D voxel 观测用 Perceiver 变异器编码, 并通过“ 探测下一个最佳 voxel 动作” 进行输出分解行动。 与在 2D 图像上运行的框架不同, 蒸化的观测和行动空间为高效学习 6 - DoF 政策提供了强大的结构。 有了这个配方, 我们训练了一个单一的多塔克变异器, 用于18 RLBench 任务( 249 变异) 和 7个真实世界任务( 18变异) 。 我们的结果显示, PerAct 明显超越了没有结构化的图像到动作的图像操作和3D ConNet 基线, 3D ConfNet 为广泛的桌面任务。