Prompt-based learning has emerged as a successful paradigm in natural language processing, where a single general-purpose language model can be instructed to perform any task specified by input prompts. Yet task specification in robotics comes in various forms, such as imitating one-shot demonstrations, following language instructions, and reaching visual goals. They are often considered different tasks and tackled by specialized models. This work shows that we can express a wide spectrum of robot manipulation tasks with multimodal prompts, interleaving textual and visual tokens. We design a transformer-based generalist robot agent, VIMA, that processes these prompts and outputs motor actions autoregressively. To train and evaluate VIMA, we develop a new simulation benchmark with thousands of procedurally-generated tabletop tasks with multimodal prompts, 600K+ expert trajectories for imitation learning, and four levels of evaluation protocol for systematic generalization. VIMA achieves strong scalability in both model capacity and data size. It outperforms prior SOTA methods in the hardest zero-shot generalization setting by up to $2.9\times$ task success rate given the same training data. With $10\times$ less training data, VIMA still performs $2.7\times$ better than the top competing approach. We open-source all code, pretrained models, dataset, and simulation benchmark at https://vimalabs.github.io
翻译:快速学习已成为自然语言处理的成功范例,在自然语言处理中,可以指示单一通用语言模式执行输入提示规定的任何任务;然而,机器人的任务规格以各种形式出现,例如模仿一次性演示、遵循语言指示、达到视觉目标等,通常被视为不同的任务,由专门模型处理;这项工作表明,我们可以用多式提示、互换文本和视觉符号来表达广泛的机器人操纵任务;我们设计了一个基于变压器的通用机器人代理,即VIMA,可以自动处理这些提示和产出运动动作。为培训和评价VIMA,我们开发了一个新的模拟基准,以数千项程序生成的桌面任务为模式提示,600K+专家轨道用于模拟学习,以及四级评价协议用于系统化概括。VIMA在模型能力和数据大小方面都具有很强的可缩放性。我们设计了一个基于变压式通用机器人方法,在最硬化的零光化方法中比SOTA方法要优,在相同培训金额上仍达2.9美元的任务成功率。我们开发了所有最优的模型,在标准前的VIAMA之前的数据比标准要低。