Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image. Even though the labeling effort required for building HOI detection datasets is inherently more extensive than for many other computer vision tasks, weakly-supervised directions in this area have not been sufficiently explored due to the difficulty of learning human-object interactions with weak supervision, rooted in the combinatorial nature of interactions over the object and predicate space. In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels, with the help of a pretrained vision-language model (VLM) and a large language model (LLM). We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model. Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis on unlikely interactions. Lastly, we use an auxiliary weakly-supervised preposition prediction task to make our model explicitly reason about space. Extensive experiments and ablations show that all of our contributions increase HOI detection performance.
翻译:人体和物体的相互作用(HOI)探测人类和物体的相互作用(HOI)旨在从特定自然图像中提取相互作用的人体和物体配对及其相互作用类别。尽管建造HOI探测数据集所需的标签工作本身比许多其他计算机视觉任务要广泛得多,但由于难以学习人和物体相互作用,而监督薄弱,其根源在于在物体和上游空间上相互作用的组合性质,因此对这方面的方向没有进行充分的探讨。在本文中,我们用文献中最弱的监督环境来处理HOI探测问题,只使用图像级互动标签,并借助预先训练的视觉语言模型(VLM)和大型语言模型(LLM)来帮助。我们首先提出一种方法,即利用视觉和上游空间模型的地面能力,提高包内正面对子的质量,利用视觉模型的地面能力。第二,我们使用一个大型语言模型来查询在人类和特定物体类别上可能发生相互作用的情况,以便迫使模型不强调不可能的相互作用。最后,我们使用一种辅助性、较弱的大规模、可测量的实验,并明确展示我们所有空间探测前的预测理由。</s>