Human-Object Interaction (HOI) recognition in videos is important for analyzing human activity. Most existing work focusing on visual features usually suffer from occlusion in the real-world scenarios. Such a problem will be further complicated when multiple people and objects are involved in HOIs. Consider that geometric features such as human pose and object position provide meaningful information to understand HOIs, we argue to combine the benefits of both visual and geometric features in HOI recognition, and propose a novel Two-level Geometric feature-informed Graph Convolutional Network (2G-GCN). The geometric-level graph models the interdependency between geometric features of humans and objects, while the fusion-level graph further fuses them with visual features of humans and objects. To demonstrate the novelty and effectiveness of our method in challenging scenarios, we propose a new multi-person HOI dataset (MPHOI-72). Extensive experiments on MPHOI-72 (multi-person HOI), CAD-120 (single-human HOI) and Bimanual Actions (two-hand HOI) datasets demonstrate our superior performance compared to state-of-the-arts.
翻译:以视觉特征为重点的现有工作通常在现实世界的情景中受到排斥。当多个人和物体参与 HOI 时,这样的问题会更加复杂。考虑到人类表面和物体位置等几何特征为理解 HOI 提供了有意义的信息,我们主张将HOI 识别中的视觉和几何特征的惠益结合起来,并提出一个新的两级地貌地貌地貌信息定位图网络(2G-GCN ) ; 人类和物体的几何级图形模型,其几何特征与物体的几何特征之间的相互依赖性,而聚变层面的图形则进一步将它们与人类和物体的视觉特征联系起来。为了展示我们的方法在富有挑战性的情景中的新颖性和有效性,我们提议建立一个新的多人 HOI数据集(MPHOI-72)。关于MPHOI-72 (多人 HOI), CAD-120 (单人 HOI) 和双手 HOI) 动作(两手 HOI) 数据集展示了我们的优异性。