Seemingly simple natural language requests to a robot are generally underspecified, for example "Can you bring me the wireless mouse?" When viewing mice on the shelf, the number of buttons or presence of a wire may not be visible from certain angles or positions. Flat images of candidate mice may not provide the discriminative information needed for "wireless". The world, and objects in it, are not flat images but complex 3D shapes. If a human requests an object based on any of its basic properties, such as color, shape, or texture, robots should perform the necessary exploration to accomplish the task. In particular, while substantial effort and progress has been made on understanding explicitly visual attributes like color and category, comparatively little progress has been made on understanding language about shapes and contours. In this work, we introduce a novel reasoning task that targets both visual and non-visual language about 3D objects. Our new benchmark, ShapeNet Annotated with Referring Expressions (SNARE), requires a model to choose which of two objects is being referenced by a natural language description. We introduce several CLIP-based models for distinguishing objects and demonstrate that while recent advances in jointly modeling vision and language are useful for robotic language understanding, it is still the case that these models are weaker at understanding the 3D nature of objects -- properties which play a key role in manipulation. In particular, we find that adding view estimation to language grounding models improves accuracy on both SNARE and when identifying objects referred to in language on a robot platform.
翻译:向机器人提出的自然语言请求似乎简单,但通常没有被详细指定,例如“你能否给我带来无线鼠鼠标?”等。当在架子上查看鼠鼠时,从某些角度或位置看,可能看不到按钮的数量或电线的存在,但从某些角度或位置看可能看不到。候选小鼠的简单图像可能无法提供“无线”所需的歧视性信息。在这项工作中,我们引入一个新的推理任务,即世界和其中的物体不是平坦的图像,而是复杂的3D形状。如果人类请求基于其任何基本属性,如颜色、形状或纹理等对象,机器人应进行必要的探索以完成任务。特别是,虽然在明确理解颜色和类别等视觉属性方面已经做了大量的努力和进展,但在了解形状和轮廓的清晰性方面,我们引入了几个基于CLIP的模型来区分对象,并展示了相对精准的颜色,同时在共同理解这些模型的模型中,我们也可以找到一个较弱的模型。