Human-Object Interaction (HOI) detection, which localizes and infers relationships between human and objects, plays an important role in scene understanding. Although two-stage HOI detectors have advantages of high efficiency in training and inference, they suffer from lower performance than one-stage methods due to the old backbone networks and the lack of considerations for the HOI perception process of humans in the interaction classifiers. In this paper, we propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems. First, we propose a novel feature extraction method suitable for the Vision Transformer backbone, called masking with overlapped area (MOA) module. The MOA module utilizes the overlapped area between each patch and the given region in the attention function, which addresses the quantization problem when using the Vision Transformer backbone. In addition, we design a graph with a pose-conditioned self-loop structure, which updates the human node encoding with local features of human joints. This allows the classifier to focus on specific human joints to effectively identify the type of interaction, which is motivated by the human perception process for HOI. As a result, ViPLO achieves the state-of-the-art results on two public benchmarks, especially obtaining a +2.07 mAP performance gain on the HICO-DET dataset. The source codes are available at https://github.com/Jeeseung-Park/ViPLO.
翻译:人-物交互(HOI)检测在场景理解中起着重要作用,可定位并推断人与物体之间的关系。尽管两阶段的HOI检测器在训练和推断的高效性方面具有优势,但由于旧的骨干网络和缺乏对人类HOI感知过程的考虑,它们遭受比单阶段方法更低的性能。在本文中,我们提出了基于视觉Transformer的姿态条件自环图(ViPLO)来解决这些问题。首先,我们提出了一种适合视觉Transformer骨干的特征提取方法,称为“重叠区域屏蔽”(MOA)模块。 MOA利用每个补丁与关注功能中给定区域之间的重叠区域来解决使用视觉Transformer骨干时的量化问题。此外,我们设计了一个具有姿态条件自环结构的图形,该结构使用人体关节的局部特征更新人类节点编码。这使分类器可以专注于特定的人体关节,从而有效地识别交互类型,这受到人类HOI感知过程的启发。结果,ViPLO在两个公共基准测试中取得了最新的结果,尤其是在HICO-DET数据集上获得了+ 2.07个mAP性能增益。源代码可在https://github.com/Jeeseung-Park/ViPLO获取。