Detecting human interactions is crucial for human behavior analysis. Many methods have been proposed to deal with Human-to-Object Interaction (HOI) detection, i.e., detecting in an image which person and object interact together and classifying the type of interaction. However, Human-to-Human Interactions, such as social and violent interactions, are generally not considered in available HOI training datasets. As we think these types of interactions cannot be ignored and decorrelated from HOI when analyzing human behavior, we propose a new interaction dataset to deal with both types of human interactions: Human-to-Human-or-Object (H2O). In addition, we introduce a novel taxonomy of verbs, intended to be closer to a description of human body attitude in relation to the surrounding targets of interaction, and more independent of the environment. Unlike some existing datasets, we strive to avoid defining synonymous verbs when their use highly depends on the target type or requires a high level of semantic interpretation. As H2O dataset includes V-COCO images annotated with this new taxonomy, images obviously contain more interactions. This can be an issue for HOI detection methods whose complexity depends on the number of people, targets or interactions. Thus, we propose DIABOLO (Detecting InterActions By Only Looking Once), an efficient subject-centric single-shot method to detect all interactions in one forward pass, with constant inference time independent of image content. In addition, this multi-task network simultaneously detects all people and objects. We show how sharing a network for these tasks does not only save computation resource but also improves performance collaboratively. Finally, DIABOLO is a strong baseline for the new proposed challenge of H2O Interaction detection, as it outperforms all state-of-the-art methods when trained and evaluated on HOI dataset V-COCO.
翻译:检测人类互动对于人类行为分析至关重要。 我们提出了许多方法来处理人类对人体互动(HOI)的检测, 也就是说, 在图像中检测人与对象的相互作用, 并对互动的类型进行分类。 然而, 人类对人的互动( 如社会和暴力的互动) 通常没有在现有的 HOI 培训数据集中被考虑。 我们认为, 在分析人类行为时, 这些类型的互动不能被忽略和与 HOI 的装饰性相关, 我们提出了一个新的互动数据集, 以处理两种类型的人类互动: 人类对人体或对象的检测(H2O) 。 此外, 我们引入了一种新的动词的分类, 意在更接近于描述人与相互作用对象相关的人体对人体的描述, 以及更独立于环境的。 与一些现有的数据集不同, 我们努力避免在使用等离线型目标类型时, 或需要高层次的解读。 由于 H2O 数据设置包括V- CO 图像与新分类中的一种注解的图像, 并且在新的基准互动中, 也必然地显示一种不断的图像的检测方法。