Recent developments in transformer models for visual data have led to significant improvements in recognition and detection tasks. In particular, using learnable queries in place of region proposals has given rise to a new class of one-stage detection models, spearheaded by the Detection Transformer (DETR). Variations on this one-stage approach have since dominated human-object interaction (HOI) detection. However, the success of such one-stage HOI detectors can largely be attributed to the representation power of transformers. We discovered that when equipped with the same transformer, their two-stage counterparts can be more performant and memory-efficient, while taking a fraction of the time to train. In this work, we propose the Unary-Pairwise Transformer, a two-stage detector that exploits unary and pairwise representations for HOIs. We observe that the unary and pairwise parts of our transformer network specialise, with the former preferentially increasing the scores of positive examples and the latter decreasing the scores of negative examples. We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches. At inference time, our model with ResNet50 approaches real-time performance on a single GPU.
翻译:视觉数据变异器模型的最近发展使识别和探测任务有了显著的改善,特别是,利用可学习的查询取代区域建议,产生了由探测变异器(DETR)牵头的新型单阶段检测模型。这一一阶段方法的变动自此以来主导了人类与物体的互动(HOI)检测。然而,这种一阶段HOI探测器的成功在很大程度上可归因于变异器的演示力。我们发现,如果配备同样的变异器,其两阶段对等器可以更出色、更具有记忆效率,同时要花一点时间进行培训。我们在此工作中建议使用双阶段变异器(Unary-PairWise 变异器),这是一个两阶段检测器,利用对口的对口演示器。我们注意到,我们变异器网络的单级和对口部分特别功能,前一是增加正面例子的分数,后一是减少负面例子的分数。我们评估了我们在HICO-DET和V-COCO数据集上采用的方法,同时用一小部分时间来培训。我们建议采用两阶段的双级转换器探测器,并大大超越了我们实际的G-PU-S-PAR方法。