自监管的视频获取回录视频变换器网络 (Self-supervised Video Retrieval Transformer Network)

Content-based video retrieval aims to find videos from a large video database that are similar to or even near-duplicate of a given query video. Video representation and similarity search algorithms are crucial to any video retrieval system. To derive effective video representation, most video retrieval systems require a large amount of manually annotated data for training, making it costly inefficient. In addition, most retrieval systems are based on frame-level features for video similarity searching, making it expensive both storage wise and search wise. We propose a novel video retrieval system, termed SVRTN, that effectively addresses the above shortcomings. It first applies self-supervised training to effectively learn video representation from unlabeled data to avoid the expensive cost of manual annotation. Then, it exploits transformer structure to aggregate frame-level features into clip-level to reduce both storage space and search complexity. It can learn the complementary and discriminative information from the interactions among clip frames, as well as acquire the frame permutation and missing invariant ability to support more flexible retrieval manners. Comprehensive experiments on two challenging video retrieval datasets, namely FIVR-200K and SVD, verify the effectiveness of our proposed SVRTN method, which achieves the best performance of video retrieval on accuracy and efficiency.

翻译：以内容为基础的视频检索旨在从大型视频数据库中找到与某一询问视频相类似甚至近近复制的、与某一询问视频视频相相近甚至近近复制的视频数据库中找到视频,视频代表和类似搜索算法对于任何视频检索系统都至关重要。为了获得有效的视频代表,大多数视频检索系统需要大量的人工附加说明的培训数据,因此成本低的低效率;此外,大多数检索系统都是基于视频相似搜索的框架级特征,使存储的智慧和搜索更加明智地都具有昂贵的存储和搜索费用。我们提议建立一个新视频检索系统,称为SVRTN,以有效解决上述缺陷。我们首先进行自我监督培训,以便从未贴标签的数据中有效地学习视频代表,从而有效地从未贴标签的数据中学习视频代表,从而避免人工注解的昂贵费用。随后,它利用变压器结构将框架级的功能综合起来,形成剪辑级,以减少存储空间和搜索的复杂程度。此外,大多数检索系统可以从剪贴框架之间的相互作用中学习补充和歧视性信息,并获得框架的调整和缺失能力,以支持更灵活的检索方式。我们拟议的SIVR-200K和SV-R-R-200K和SVD-R-SV-R-R-R-R-R-R-R-S-R-R-R-R-SVD-R-S-S-R-R-S-R-S-R-S-R-S-S-R-R-R-R-S-R-R-S-R-R-R-S-R-R-S-R-R-R-R-R-S-S-S-R-R-R-R-S-R-R-R-S-S-S-S-S-S-S-S-S-R-R-R-S-S-R-S-S-S-R-R-S-S-S-R-R-R-R-R-R-R-R-R-R-R-R-R-S-S-S-R-S-S-S-R-R-S-R-S-R-R-R-R-R-R-R-R-R-R-R-S-S-S-S-