The key of Human-Object Interaction(HOI) recognition is to infer the relationship between human and objects. Recently, the image's Human-Object Interaction(HOI) detection has made significant progress. However, there is still room for improvement in video HOI detection performance. Existing one-stage methods use well-designed end-to-end networks to detect a video segment and directly predict an interaction. It makes the model learning and further optimization of the network more complex. This paper introduces the Spatial Parsing and Dynamic Temporal Pooling (SPDTP) network, which takes the entire video as a spatio-temporal graph with human and object nodes as input. Unlike existing methods, our proposed network predicts the difference between interactive and non-interactive pairs through explicit spatial parsing, and then performs interaction recognition. Moreover, we propose a learnable and differentiable Dynamic Temporal Module(DTM) to emphasize the keyframes of the video and suppress the redundant frame. Furthermore, the experimental results show that SPDTP can pay more attention to active human-object pairs and valid keyframes. Overall, we achieve state-of-the-art performance on CAD-120 dataset and Something-Else dataset.
翻译:人类- 物体互动( HOI) 识别的关键在于推断人与物体之间的关系。 最近, 图像的人类- 物体互动( HOI) 检测取得了显著进展。 然而, 视频 HOI 检测性能仍有改进的余地。 现有的一阶段方法使用设计完善的端对端网络检测视频段并直接预测互动。 这使得网络的模型学习和进一步优化更为复杂。 本文介绍了空间分析与动态时空共享( SPDTP) 网络, 将整个视频作为人与对象节点的时空图。 与现有方法不同, 我们提议的网络通过明确的空间分割预测互动与非互动对对的区别, 然后进行互动识别。 此外, 我们提出一个可学习和不同的动态时空模块( DTM ), 以强调视频的关键框架, 并抑制冗余框架 。 此外, 实验结果显示, SPDTP 能够更加关注积极的人类截线配方和有效的键框数据 120 。 总体而言, 我们实现了状态- 数据 。