Benefiting from language flexibility and compositionality, humans naturally intend to use language to command an embodied agent for complex tasks such as navigation and object manipulation. In this work, we aim to fill the blank of the last mile of embodied agents -- object manipulation by following human guidance, e.g., "move the red mug next to the box while keeping it upright." To this end, we introduce an Automatic Manipulation Solver (AMSolver) simulator and build a Vision-and-Language Manipulation benchmark (VLMbench) based on it, containing various language instructions on categorized robotic manipulation tasks. Specifically, modular rule-based task templates are created to automatically generate robot demonstrations with language instructions, consisting of diverse object shapes and appearances, action types, and motion constraints. We also develop a keypoint-based model 6D-CLIPort to deal with multi-view observations and language input and output a sequence of 6 degrees of freedom (DoF) actions. We hope the new simulator and benchmark will facilitate future research on language-guided robotic manipulation.
翻译:人类从语言灵活性和构成性中获益,自然打算使用语言来指挥导航和物体操纵等复杂任务的体现剂。在这项工作中,我们的目标是填补内装剂最后一英里的空白 -- -- 以人的指导方式进行物体操纵,例如“将红色杯子移到盒子旁边,同时保持其直线。” 为此,我们引入了自动操纵解答器(AMSolver)模拟器,并在此基础上建立一个愿景和语言操纵基准(VLMbench),其中载有关于分类机器人操作任务的各种语言指示。具体地说,模块化基于规则的任务模板旨在自动生成带有语言指示的机器人演示,由不同对象形状和外观、动作类型和动作限制组成。我们还开发了一个基于关键点的模型 6D-CLIPort, 以处理多视角观测和语言输入,并输出一个6度自由(DoF)行动的序列。我们希望新的模拟器和基准将便利今后关于语言引导机器人操纵的研究。