For humans, understanding the relationships between objects using visual signals is intuitive. For artificial intelligence, however, this task remains challenging. Researchers have made significant progress studying semantic relationship detection, such as human-object interaction detection and visual relationship detection. We take the study of visual relationships a step further from semantic to geometric. In specific, we predict relative occlusion and relative distance relationships. However, detecting these relationships from a single image is challenging. Enforcing focused attention to task-specific regions plays a critical role in successfully detecting these relationships. In this work, (1) we propose a novel three-decoder architecture as the infrastructure for focused attention; 2) we use the generalized intersection box prediction task to effectively guide our model to focus on occlusion-specific regions; 3) our model achieves a new state-of-the-art performance on distance-aware relationship detection. Specifically, our model increases the distance F1-score from 33.8% to 38.6% and boosts the occlusion F1-score from 34.4% to 41.2%. Our code is publicly available.
翻译:对于人类来说,理解使用视觉信号的物体之间的关系是直观的。然而,对于人工智能来说,这项任务仍然具有挑战性。研究人员在研究语义关系探测方面取得了显著的进展,例如人体物体相互作用探测和视觉关系探测。我们把视觉关系研究从语义学到几何学更进一步。具体地说,我们预测了相对隔离和相对距离关系。然而,从单一图像中发现这些关系具有挑战性。在成功发现这些关系方面,对任务特定区域给予集中关注具有关键作用。在这项工作中,(1) 我们提议建立一个新的三分解器结构,作为集中关注的基础设施;(2) 我们使用通用的交叉箱预测任务,有效地指导我们的模型,以隐蔽特定区域为重点;(3) 我们的模型在远程认知关系探测方面实现了一种新的状态的艺术表现。具体地说,我们的模型将F1核心的距离从33.8%提高到38.6%,并将隐蔽的F1核心从34.4%提高到41.2%。我们的代码是公开提供的。