Intuitive user interfaces are indispensable to interact with the human centric smart environments. In this paper, we propose a unified framework that recognizes both static and dynamic gestures, using simple RGB vision (without depth sensing). This feature makes it suitable for inexpensive human-robot interaction in social or industrial settings. We employ a pose-driven spatial attention strategy, which guides our proposed Static and Dynamic gestures Network - StaDNet. From the image of the human upper body, we estimate his/her depth, along with the region-of-interest around his/her hands. The Convolutional Neural Network in StaDNet is fine-tuned on a background-substituted hand gestures dataset. It is utilized to detect 10 static gestures for each hand as well as to obtain the hand image-embeddings. These are subsequently fused with the augmented pose vector and then passed to the stacked Long Short-Term Memory blocks. Thus, human-centred frame-wise information from the augmented pose vector and from the left/right hands image-embeddings are aggregated in time to predict the dynamic gestures of the performing person. In a number of experiments, we show that the proposed approach surpasses the state-of-the-art results on the large-scale Chalearn 2016 dataset. Moreover, we transfer the knowledge learned through the proposed methodology to the Praxis gestures dataset, and the obtained results also outscore the state-of-the-art on this dataset.
翻译:直觉用户界面对于与人类中心智能环境互动是不可或缺的。 在本文中, 我们提出一个统一框架, 以简单的 RGB 视觉( 不受深度感测) 来识别静态和动态姿态, 使用简单的 RGB 视觉( 不受深度感测 ) 。 这个功能使得它适合社会或工业环境中廉价的人- 机器人互动。 我们使用一种由表面驱动的空间关注策略, 指导我们提议的静态和动态姿态网络 - StaDNet。 我们从人体上部的图像中, 估计他的深度, 以及他/ 她双手周围的区域利益。 StaDNet 的进化神经网络在背景替代手势动作数据集上进行精细调整。 它被用来检测每只手的10个静态动作, 并获得手动图像组合。 这些策略随后会与增强的变形矢和动态动作网络- StaDNet 连接起来, 然后传递到堆叠的长的短期记忆区块。 因此, 我们从增强的向后方矢量和左/ 右手图像组合中获取的以人为中心的框架信息。 Strealalal 网络网络网络网络网络网络网络在时间里进行精化, 预测一个背景的动态姿态, 。 在演示方法上我们提出的大规模数据转换中, 数据转换中显示的模型中, 我们所提议的数据转换取的数据转换到 。