Non-verbal communication plays a particularly important role in a wide range of scenarios in Human-Robot Interaction (HRI). Accordingly, this work addresses the problem of human gesture recognition. In particular, we focus on head and eye gestures, and adopt an egocentric (first-person) perspective using eyewear cameras. We argue that this egocentric view may offer a number of conceptual and technical benefits over scene- or robot-centric perspectives. A motion-based recognition approach is proposed, which operates at two temporal granularities. Locally, frame-to-frame homographies are estimated with a convolutional neural network (CNN). The output of this CNN is input to a long short-term memory (LSTM) to capture longer-term temporal visual relationships, which are relevant to characterize gestures. Regarding the configuration of the network architecture, one particularly interesting finding is that using the output of an internal layer of the homography CNN increases the recognition rate with respect to using the homography matrix itself. While this work focuses on action recognition, and no robot or user study has been conducted yet, the system has been designed to meet real-time constraints. The encouraging results suggest that the proposed egocentric perspective is viable, and this proof-of-concept work provides novel and useful contributions to the exciting area of HRI.
翻译:非言语交流在人类-机器人互动(HRI)的广泛情景中发挥着特别重要的作用。因此,这项工作解决了人类姿态识别问题。特别是,我们注重头部和眼部手势,并使用眼部照相机采用自我中心(第一人)观点。我们争辩说,这种以自我为中心的观点可能会为场景或机器人中心观点提供一些概念和技术好处。提出了以运动为基础的识别方法,该方法在两种时间粒子上运作。从局部角度讲,对框架到框架的同系物以动态神经网络(CNN)进行估计。这个CNN的输出是对长期短期记忆(LSTM)的投入,以捕捉长期的短期视觉关系,这与描述手势有关。关于网络结构的配置,一个特别有趣的发现是,使用同性恋CNN的内层的输出提高了使用同系矩阵本身的识别率。虽然这项工作侧重于行动识别,但尚未进行机器人或用户研究,但该系统的输出是为了满足实时限制。关于识别长期时间的短期视觉关系(LSTMM)的输入。关于网络结构结构结构结构结构的令人鼓舞的结果表明,拟议的核心是令人振动动的自我观点。