Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.
翻译:人类非常擅长理解语言和愿景,以完成各种各样的任务。相比之下,创建通用教学的体现代理物仍是一个困难的挑战。以前使用纯语言型号的工作缺乏视觉基础,因此难以将语言指示与视觉观察联系起来。另一方面,使用经过预先训练的视觉语言型号的方法通常具有语言和视觉表达方式的差别,需要设计专门的网络结构,以将它们结合在一起。我们为机器人提出了一个简单而有效的模型,以便在基于视觉的环境中解决教学执行任务。我们提出的方法是:我们采用一种简单而有效的模型,在视觉观察和语言指示中编码的多式联运变异器,以及一种预测基于编码代表物的行动的政策变异器。多式联运变变变器对数百万个图像-文本和自然语言文本进行了预先培训,从而产生通用的跨模式的观察和指示表达方式。政策变异器跟踪观测和行动的整个历史,并自动预测行动。我们显示,这种统一的变异器模型超越了所有经过预先训练或经过训练的状态变异模式变异的变异器,并且预测了基于编码的变异式模型的变异性,在单项工作设置中也比我们一般的变型和多种变装能力都显示我们一般的变装能力。