Instruction-following agents must ground language into their observation and action spaces. Learning to ground language is challenging, typically requiring domain-specific engineering or large quantities of human interaction data. To address this challenge, we propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER), using a VLM to retroactively generate language describing the agent's behavior. Simple prompting allows us to control the supervision signal, teaching an agent to interact with novel objects based on their names (e.g., planes) or their features (e.g., colors) in a 3D rendered environment. Fewshot prompting lets us teach abstract category membership, including pre-existing categories (food vs toys) and ad-hoc ones (arbitrary preferences over objects). Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
翻译:学习地面语言具有挑战性, 通常需要特定领域的工程或大量的人类互动数据。 为了应对这一挑战, 我们提议使用预先训练的视觉语言模型来监管包含的代理。 我们将模型蒸馏和后视经验重播(HER)中的想法结合起来, 使用 VLM 来追溯生成描述该代理行为的语言。 简单的提示让我们能够控制监督信号, 教一个代理根据名称( 如飞机) 或其特征( 如颜色) 在3D 设定的环境中与新事物互动。 少发提示让我们教授抽象类别成员, 包括原有类别( 食品与玩具) 和 ad-hoc 类( 对对象的任意偏好 ) 。 我们的工作概述了一种使用互联网规模的 VLM 的新的有效方法, 重新定位这些模型获得的通用语言基础, 教给包含任务的代理。