The goal of this paper is Human-object Interaction (HO-I) detection. HO-I detection aims to find interacting human-objects regions and classify their interaction from an image. Researchers obtain significant improvement in recent years by relying on strong HO-I alignment supervision from [5]. HO-I alignment supervision pairs humans with their interacted objects, and then aligns human-object pair(s) with their interaction categories. Since collecting such annotation is expensive, in this paper, we propose to detect HO-I without alignment supervision. We instead rely on image-level supervision that only enumerates existing interactions within the image without pointing where they happen. Our paper makes three contributions: i) We propose Align-Former, a visual-transformer based CNN that can detect HO-I with only image-level supervision. ii) Align-Former is equipped with HO-I align layer, that can learn to select appropriate targets to allow detector supervision. iii) We evaluate Align-Former on HICO-DET [5] and V-COCO [13], and show that Align-Former outperforms existing image-level supervised HO-I detectors by a large margin (4.71% mAP improvement from 16.14% to 20.85% on HICO-DET [5]).
翻译:本文的目标是人体物体相互作用(HO-I)检测。 HO-I检测的目的是寻找相互作用的人体物体区域,并将它们从图像中分类。近年来,研究人员通过依赖[5] 的强力HO-I校准监督,在最近几年里通过依赖[5] 的强力HO-I校准监督,取得了显著的改进。 HO-I校准监管将人与互动对象配对,然后将人体物体对配对与其互动类别挂钩。由于收集这种说明费用昂贵,我们提议在本文中不经校准监督地检测HO-I。我们相反地依靠仅罗列图像内现有相互作用的图像级监督。我们的文件做出了三项贡献:(i) 我们建议采用基于视觉转换的CNNNC,能够用图像一级的监督来检测HO-I 。 (ii) Align-I) 装备了HO-I 校准的相配对层,可以学会选择适当的目标来进行检测。 (iii) 我们对HICO- DET [5] 和V-CO 13] 的图像级监督,并显示ARI-FER-A-FERM-MA% 的高级探测器。