Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms of accuracy and speed, while showcasing robustness to variation in the natural language input.
翻译:人类自然机器人互动(HRI)是服务机器人能够在以人类为中心的环境中工作的关键组成部分之一。在这种动态环境中,机器人需要理解用户成功完成任务的意图。为了解决这个问题,我们提议了一个软件结构,从拥挤的场景中将目标物体分割成一个部分,由人类用户口头表示。在我们系统的核心,我们使用一个多模式的深层神经网络进行视觉地面定位。与大多数使用预先训练的物体探测器通过两步进程应对挑战的地面方法不同,我们开发了一个单一阶段零弹模型,能够提供无法见数据的预测。我们评估了从公共场景数据集收集的真实 RGB-D数据的拟议模型的性能。实验结果显示,拟议的模型在准确性和速度方面运行良好,同时展示了自然语言输入的变化的稳健性。