Non-verbal communication plays a particularly important role in a wide range of scenarios in Human-Robot Interaction (HRI). Accordingly, this work addresses the problem of human gesture recognition. In particular, we focus on head and eye gestures, and adopt an egocentric (first-person) perspective using eyewear cameras. We argue that this egocentric view offers a number of conceptual and technical benefits over scene- or robot-centric perspectives. A motion-based recognition approach is proposed, which operates at two temporal granularities. Locally, frame-to-frame homographies are estimated with a convolutional neural network (CNN). The output of this CNN is input to a long short-term memory (LSTM) to capture longer-term temporal visual relationships, which are relevant to characterize gestures. Regarding the configuration of the network architecture, one particularly interesting finding is that using the output of an internal layer of the homography CNN increases the recognition rate with respect to using the homography matrix itself. While this work focuses on action recognition, and no robot or user study has been conducted yet, the system has been de signed to meet real-time constraints. The encouraging results suggest that the proposed egocentric perspective is viable, and this proof-of-concept work provides novel and useful contributions to the exciting area of HRI.
翻译:非言语交流在人类-机器人互动(HRI)的广泛情景中发挥着特别重要的作用。因此,这项工作解决了人类姿态识别问题。特别是,我们注重头部和眼部手势,并使用眼部照相机采用自我中心(第一人)观点。我们争辩说,这种以自我为中心的观点在概念和技术方面对现场或机器人中心观点具有诸多好处。提出了以运动为基础的识别方法,该方法在两种时间粒子上运作。从地方角度,对框架到框架的同系物以动态神经网络(CNN)进行估算。这个CNN的输出是对长期短期记忆(LSTM)的投入,以捕捉长期的短期视觉关系,这与描述手势有关。关于网络结构的配置,一个特别有趣的发现是,使用同性恋CNN的内层的输出提高了使用同系矩阵本身的识别率。虽然这项工作侧重于行动识别,但尚未进行机器人或用户研究,但该系统的输出是为了满足实时限制。关于识别长期时间的短期视觉关系(LSTMM)的输入。关于网络结构结构结构结构的结构,一个特别有趣的发现,利用CNNCLLLA本身的拟议核心是令人振动的眼光,它提供了一种令人振动的核心观点。