In this paper, we realize automatic visual recognition and direction estimation of pointing. We introduce the first neural pointing understanding method based on two key contributions. The first is the introduction of a first-of-its-kind large-scale dataset for pointing recognition and direction estimation, which we refer to as the DP Dataset. DP Dataset consists of more than 2 million frames of over 33 people pointing in various styles annotated for each frame with pointing timings and 3D directions. The second is DeePoint, a novel deep network model for joint recognition and 3D direction estimation of pointing. DeePoint is a Transformer-based network which fully leverages the spatio-temporal coordination of the body parts, not just the hands. Through extensive experiments, we demonstrate the accuracy and efficiency of DeePoint. We believe DP Dataset and DeePoint will serve as a sound foundation for visual human intention understanding.
翻译:在本文中,我们实现了指向的自动视觉识别和方向估计。我们引入了第一个基于两个关键贡献的神经指向理解方法。第一个是引入 DP 数据集,它是首个大规模的指向识别和方向估计数据集。 DP 数据集包括超过 33 个人以各种风格指向的超过 2 百万帧,并标注了每帧的指向时间和 3D 方向。第二个是 DeePoint,这是一种新颖的深度网络模型,用于指向的联合识别和 3D 方向估计。 DeePoint 是一种基于 Transformer 的网络,完全利用了身体部位的时空协同性,而不仅仅是手部。通过大量的实验,我们证明了 DeePoint 的准确性和高效性。我们相信 DP 数据集和 DeePoint 将为视觉人类意图理解奠定坚实的基础。