Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) system and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.
翻译:人类自然打算使用语言来指挥一个包含在导航和天体操纵等复杂任务中的代理物。在这项工作中,我们的目标是填补内装物代理物最后一英里的空白 -- -- 用人的指导来操纵物体,例如,“将红色杯子移到盒子旁边,同时保持其直率。” 为此,我们引入了一个自动操纵解答器(AMSolver)系统,并以此为基础建立一个愿景和语言操纵基准(VLMbench),其中载有关于分类机器人操纵任务的各种语言指示。具体地说,基于规则的模块任务模板旨在自动生成带有语言指示的机器人演示,包括不同的对象形状和外观、动作类型和动作限制。我们还开发了一个基于关键点的模型 6D-CLIPort, 以处理多视角观测和语言输入,并输出一个6度自由动作的序列(DoF)。我们希望新的模拟器和基准将促进今后关于语言指导机器人操纵的研究。