Video Visual Relation Detection (VidVRD), has received significant attention of our community over recent years. In this paper, we apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals. Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations. Specifically, we design a tracklet-based visual Transformer. It contains a temporal-aware decoder which performs feature interactions between the tracklets and learnable predicate query embeddings, and finally predicts the relations. Experimental results strongly demonstrate the superiority of our method, which outperforms other methods by a large margin on the Video Relation Understanding (VRU) Grand Challenge in ACM Multimedia 2021. Codes are released at https://github.com/Dawn-LX/VidVRD-tracklets.
翻译:近年来,视频视觉关系探测(VidVRD)受到我们社区的极大关注。在本文中,我们运用最先进的视频物体探测轨迹探测管道MEGA和深层SORT来生成跟踪建议。然后,我们以基于轨迹的方式进行VidVRD,而没有任何预先切割操作。具体地说,我们设计了一个基于轨迹的视觉变异器。它包含一个时间觉解码器,在轨迹和可学习的上游查询嵌入中进行特征互动,最后预测了关系。实验结果有力地显示了我们方法的优越性,它大大超越了2021年ACM多媒体中视频关系谅解(VRU)大挑战的其他方法。代码发布于 https://gitub.com/Dawn-LX/VidVRD-tratlets。