In human environments, robots are expected to accomplish a variety of manipulation tasks given simple natural language instructions. Yet, robotic manipulation is extremely challenging as it requires fine-grained motor control, long-term memory as well as generalization to previously unseen tasks and environments. To address these challenges, we propose a unified transformer-based approach that takes into account multiple inputs. In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations while (iii) keeping track of the full history of observations and actions. Such an approach enables learning dependencies between history and instructions and improves manipulation precision using multiple views. We evaluate our method on the challenging RLBench benchmark and on a real-world robot. Notably, our approach scales to 74 diverse RLBench tasks and outperforms the state of the art. We also address instruction-conditioned tasks and demonstrate excellent generalization to previously unseen variations.
翻译:在人类环境中,机器人预计将完成各种操纵任务,并给出简单的自然语言指令;然而,机器人操纵是极具挑战性的任务,因为它需要精细的发动机控制、长期记忆和对先前不为人知的任务和环境的概括化。为了应对这些挑战,我们提议了一种考虑到多种投入的统一变压器方法。特别是,我们的变压器结构整合了(一) 自然语言指令和(二) 多视图场景观测,同时(三) 跟踪观察和行动的全部历史。这种方法使得能够学习历史和指示之间的依赖性,并用多种观点改进操纵的精确度。我们评估了我们挑战性的RLBench基准和真实世界机器人的方法。特别是,我们对于74项不同的RLBench任务和超越艺术状态的方法。我们还处理了受指导的任务,并展示了对先前看不见的变化的精细的概括性。