To perform household tasks, assistive robots receive commands in the form of user language instructions for tool manipulation. The initial stage involves selecting the intended tool (i.e., object grounding) and grasping it in a task-oriented manner (i.e., task grounding). Nevertheless, prior researches on visual-language grasping (VLG) focus on object grounding, while disregarding the fine-grained impact of tasks on object grasping. Task-incompatible grasping of a tool will inevitably limit the success of subsequent manipulation steps. Motivated by this problem, this paper proposes GraspCLIP, which addresses the challenge of task grounding in addition to object grounding to enable task-oriented grasp prediction with visual-language inputs. Evaluation on a custom dataset demonstrates that GraspCLIP achieves superior performance over established baselines with object grounding only. The effectiveness of the proposed method is further validated on an assistive robotic arm platform for grasping previously unseen kitchen tools given the task specification. Our presentation video is available at: https://www.youtube.com/watch?v=e1wfYQPeAXU.
翻译:为了完成家庭任务,辅助机器人以用户语言指令的形式接受用于工具操控的指令。初始阶段包括选择预定工具(即对象定位),并以任务导向的方式(即任务定位)掌握工具。然而,先前对视觉语言捕捉(VLG)的研究侧重于目标定位,而忽视任务对目标捕捉的细微影响。任务不兼容的获取工具将不可避免地限制随后操作步骤的成功。由于这一问题,本文件提议GresspCLIP, 以解决任务定位的挑战, 以目标定位之外, 以便利用视觉语言投入进行任务导向的抓抓抓预测。 对自定义数据集的评估表明, GraspCLIP 仅使用物体定位的设定基线就能取得优异的性能。 拟议的方法的有效性在辅助机器人手臂平台上得到进一步验证,以根据任务规范获取先前看不见的厨房工具。我们的演示视频可在以下网址上查看: https://www.youtube.com/watch?e=e1wYQAX。</s>