Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., Between which objects are there an interaction? When do relations occur and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness. 1) TSPN tells what to look: it sparsifies relation search space by scoring relationness (i.e., confidence score for the existence of a relation between pair of objects) of object pair. 2) TSPN tells when to look: it leverages the full video context to simultaneously predict the temporal span and categories of the entire relations. TSPN demonstrates its effectiveness by achieving new state-of-the-art by a significant margin on two VidVRD benchmarks (ImageNet-VidVDR and VidOR) while also showing lower time complexity than existing methods - in particular, twice as efficient as a popular segment-based approach.
翻译:辨别对象之间的关系是了解场景的核心。 虽然在图像域中为建立关系模型提出了几项工作建议,但由于具有挑战性的时空相互作用动态(例如,在哪些物体之间有互动关系?何时发生关系和何时结束?),视频领域存在许多制约因素。迄今为止,提出了两种有代表性的方法来处理视频视觉关系探测(VidVRD):以部分为基础和以窗口为基础。我们首先指出这两种方法的局限性,并提出了Temoral Span建议网络(TSPN),这是一个在效率和效果方面有两个优势的新颖方法。 1 TSPN告诉人们:它通过对对象对配对关系(即,对对象之间是否存在关系的信任分)的评分来抽查关系空间。 2 TSPN告诉人们:它利用整个视频环境来同时预测整个关系的时间跨度和类别。 TSPN通过在VidVRD的两个基准基准(ImageNet-VidVDR)上实现新的状态展示其有效性。 1) TSPNPN在两个基点(ImageNet-VdVDRDR)上有很大的显著的差差点,同时显示特定的复杂度方法,同时显示特定的复杂度。