Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., between which objects are there an interaction? when do relations start and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out limitations of these methods and propose a novel approach named Temporal Span Proposal Network (TSPN). TSPN tells what to look: it sparsifies relation search space by scoring relationness of object pair, i.e., measuring how probable a relation exist. TSPN tells when to look: it simultaneously predicts start-end timestamps (i.e., temporal spans) and categories of the all possible relations by utilizing full video context. These two designs enable a win-win scenario: it accelerates training by 2X or more than existing methods and achieves competitive performance on two VidVRD benchmarks (ImageNet-VidVDR and VidOR). Moreover, comprehensive ablative experiments demonstrate the effectiveness of our approach. Codes are available at https://github.com/sangminwoo/Temporal-Span-Proposal-Network-VidVRD.
翻译:辨别对象之间的关系是了解场景的核心。 虽然在图像域中为建立关系模型提出了几项工作提案,但由于具有挑战性的spatio-时空互动动态(例如,在哪些对象之间存在互动?什么时候关系开始和结束?),视频领域存在许多限制(例如,在哪些对象之间有互动?何时开始和结束关系?) 。迄今为止,提出了两种具有代表性的方法来解决视频视觉关系探测(VidVRD):基于部分和基于窗口的。我们首先指出这些方法的局限性,并提出一种名为Tempoal Span Propos 网络的新颖方法(TSPN)。 TSPN告诉人们要看什么:它通过对对象对对象的对应关系进行评分来缩小关系搜索空间。 TSPN告诉人们:它同时预测启动-end timetamps(例如,时间跨度) 以及所有可能的关系类别。这两种设计可以促成双赢情景:它加快了2X或超过现有方法的培训,并在两个VRDRD基准(ImageNet-VBRVD/VDR)上实现竞争性业绩(IMG-VG/VDRVADRVD/VDR)中的全面测试方法)。