In this paper, we focus on video relocalization task, which uses a query video clip as input to retrieve a semantic relative video clip in another untrimmed long video. we find that in video relocalization datasets, there exists a phenomenon showing that there does not exist consistent relationship between feature similarity by frame and feature similarity by video, which affects the feature fusion among frames. However, existing video relocalization methods do not fully consider it. Taking this phenomenon into account, in this article, we treat video features as a graph by concatenating the query video feature and proposal video feature along time dimension, where each timestep is treated as a node, each row of the feature matrix is treated as feature of each node. Then, with the power of graph neural networks, we propose a Multi-Graph Feature Fusion Module to fuse the relation feature of this graph. After evaluating our method on ActivityNet v1.2 dataset and Thumos14 dataset, we find that our proposed method outperforms the state of art methods.
翻译:在本文中,我们侧重于视频重新定位任务,该任务使用一个查询视频剪辑作为输入,在另一个未剪裁的长视频中检索一个语义相对视频剪辑。我们发现,在视频重新定位数据集中,存在一个现象,表明按框架和特征相似的特征之间不存在一致的关系,影响各框架的特征融合。然而,现有的视频重新定位方法并不完全考虑到这种现象。考虑到这一现象,在本篇文章中,我们把视频特征作为图表处理,将查询视频特征和提议视频特征与时间维相连接,每个时间节点将每个时间节点作为每个节点处理,将特征矩阵的每行作为每个节点的特征处理。然后,在图形神经网络的力量下,我们提出一个多格特征组合模块,以融合该图的关联特性。在评估了我们关于活动网 v1.2 数据集和 Thumos14 数据集的方法之后,我们发现我们拟议的方法超出了艺术方法的状态。