Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred to by HOTR, which directly predicts a set of <human, object, interaction> triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.
翻译:人体和物体相互作用(HOI)探测是一项任务,要确定图像中的“一组相互作用”,其中涉及对互动的主题(即人类)和目标(即物体)进行定位,以及互动标签的分类。大多数现有方法都通过探测人类和物体的事例以及单独推断所发现的每一对事例间接地处理了这项任务。在本文件中,我们提出了一个由HOTR提到的新的框架,直接预测了从基于变压器编码器-解码器结构的图像中得出的一套<人类、物体、三重相互作用>。通过设定的预测,我们的方法有效地利用了图像中固有的语义关系,而不需要耗费时间的后处理,而后者是现有方法的主要瓶颈。我们提议的算法在两个HOI检测基准中实现了最先进的性能,在天体探测后1米以下的推论时间里。