Recent years, human-object interaction (HOI) detection has achieved impressive advances. However, conventional two-stage methods are usually slow in inference. On the other hand, existing one-stage methods mainly focus on the union regions of interactions, which introduce unnecessary visual information as disturbances to HOI detection. To tackle the problems above, we propose a novel one-stage HOI detection approach DIRV in this paper, based on a new concept called interaction region for the HOI problem. Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair, so as to capture the subtle visual features that is most essential to the interaction. Moreover, in order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy that makes full use of those overlapped interaction regions in place of conventional Non-Maximal Suppression (NMS). Extensive experiments on two popular benchmarks: V-COCO and HICO-DET show that our approach outperforms existing state-of-the-arts by a large margin with the highest inference speed and lightest network architecture. We achieved 56.1 mAP on V-COCO without addtional input. Our code is publicly available at: https://github.com/MVIG-SJTU/DIRV
翻译:近些年来,人体与人体之间相互作用(HOI)的检测取得了令人印象深刻的进展,然而,常规的两阶段方法通常都是缓慢的推断。另一方面,现有的一阶段方法主要侧重于相互作用的联盟区域,这些区域引入了不必要的视觉信息,作为HOI检测的干扰。为了解决上述问题,我们提议在本文件中采用一种新型的一阶段HOI检测方法DIRV, 其新概念称为HOI问题的互动区域。与以前的方法不同,我们的方法侧重于每个人体与对象对子在不同尺度的密集抽样互动区域,以便捕捉对相互作用至关重要的微妙视觉特征。此外,为了弥补单一互动区域的检测缺陷,我们采用了一种新型投票战略,充分利用这些重叠的互动区域,以取代传统的非哺乳抑制(NMSS)。关于V-CO和HICO-DET两个流行基准的广泛实验表明,我们的方法在高频频比现有状态的状态,以最高速度和最轻度的网络结构为优势。我们实现了56/MVAAP。我们在公共输入中添加了我们现有的VJ-DO的版本。