具有联合培训前愿景-语言模型的教学跟踪剂 (Instruction-Following Agents with Jointly Pre-Trained Vision-Language Models)

Humans are excellent at understanding language and vision to accomplish a wide range of tasks. In contrast, creating general instruction-following embodied agents remains a difficult challenge. Prior work that uses pure language-only models lack visual grounding, making it difficult to connect language instructions with visual observations. On the other hand, methods that use pre-trained vision-language models typically come with divided language and visual representations, requiring designing specialized network architecture to fuse them together. We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments. Our \ours method consists of a multimodal transformer that encodes visual observations and language instructions, and a policy transformer that predicts actions based on encoded representations. The multimodal transformer is pre-trained on millions of image-text pairs and natural language text, thereby producing generic cross-modal representations of observations and instructions. The policy transformer keeps track of the full history of observations and actions, and predicts actions autoregressively. We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings. Our model also shows better model scalability and generalization ability than prior work.

翻译：人类非常擅长理解语言和愿景,以完成各种各样的任务。相比之下,创建通用教学的体现代理物仍是一个困难的挑战。以前使用纯语言型号的工作缺乏视觉基础,因此难以将语言指示与视觉观察联系起来。另一方面,使用经过预先训练的视觉语言型号的方法通常具有语言和视觉表达方式的差别,需要设计专门的网络结构,以将它们结合在一起。我们为机器人提出了一个简单而有效的模型,以便在基于视觉的环境中解决教学执行任务。我们提出的方法是:我们采用一种简单而有效的模型,在视觉观察和语言指示中编码的多式联运变异器,以及一种预测基于编码代表物的行动的政策变异器。多式联运变变变器对数百万个图像-文本和自然语言文本进行了预先培训,从而产生通用的跨模式的观察和指示表达方式。政策变异器跟踪观测和行动的整个历史,并自动预测行动。我们显示,这种统一的变异器模型超越了所有经过预先训练或经过训练的状态变异模式变异的变异器,并且预测了基于编码的变异式模型的变异性,在单项工作设置中也比我们一般的变型和多种变装能力都显示我们一般的变装能力。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日