Recent high-performing Human-Object Interaction (HOI) detection techniques have been highly influenced by Transformer-based object detector (i.e., DETR). Nevertheless, most of them directly map parametric interaction queries into a set of HOI predictions through vanilla Transformer in a one-stage manner. This leaves rich inter- or intra-interaction structure under-exploited. In this work, we design a novel Transformer-style HOI detector, i.e., Structure-aware Transformer over Interaction Proposals (STIP), for HOI detection. Such design decomposes the process of HOI set prediction into two subsequent phases, i.e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer. The structure-aware Transformer upgrades vanilla Transformer by encoding additionally the holistically semantic structure among interaction proposals as well as the locally spatial structure of human/object within each interaction proposal, so as to strengthen HOI predictions. Extensive experiments conducted on V-COCO and HICO-DET benchmarks have demonstrated the effectiveness of STIP, and superior results are reported when comparing with the state-of-the-art HOI detectors. Source code is available at \url{https://github.com/zyong812/STIP}.
翻译:最近高性能的人体和人体器官互动探测技术受到基于变异器的物体探测器(即DETR)的高度影响。然而,大多数这些技术都直接将模拟互动查询映射成通过香草变异器通过香草变异器进行的一套HOI预测。这导致大量相互之间或内部互动结构没有得到充分利用。在这项工作中,我们设计了一个新型的变异器式HOI探测器,即结构认知变异器对互动提议(STIP)进行探测。这种设计将HOI的预测进程分解成两个后续阶段,即:首先进行互动建议生成,然后通过结构变异变器将非对等互动提议转换成HOI预测。结构变异器对V-CO\内部变异器变异器进行升级,对互动提议之间的整体性结构进行编码,以及每次互动提议中的人类/对象的局部空间结构,以便加强HOII的预测。在对V-CO-OVCO-SOD进行广泛的实验时,对V-CO-SI-SI-SOB/Ob_SOD 的源数据库测试,对SIR-SIR-SIR-SOBARBSBAR 的结果进行了比较。