Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks. One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only. This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal. A promising strategy to address those challenges is to exploit knowledge from large-scale pretrained models (e.g., CLIP), but a direct knowledge distillation strategy~\citep{liao2022gen} does not perform well on the weakly-supervised setting. In contrast, we develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations. Experimental results on HICO-DET and V-COCO show that our method outperforms the previous works by a sizable margin, showing the efficacy of our HOI representation.
翻译:人类物体相互作用(HOI)探测在以人为中心的场景理解中起着关键作用,并且是许多视觉任务的基本构件。HOI探测的一个普遍和可扩展的战略是使用薄弱的监督,只从图像层次的注释中学习。由于人类物体协会模棱两可,探测HOI的搜索空间大,以及高度吵闹的培训信号,这本身就具有挑战性。应对这些挑战的一个有希望的战略是利用大规模预先训练的模型(例如CLIP)的知识,但直接的知识蒸馏战略“citep{liao2022gen}在薄弱的监视环境中效果不佳。相反,我们开发了CLIP指导HOI代表制,能够将先前在图像层次和HOI实例一级的知识纳入其中,并采用自学机制来排除不正确的人类物体协会。HICO-DET和V-COCO的实验结果显示,我们的方法比以前的工作要差得多,显示了我们HOI代表的功效。</s>