DnS: 高效和准确的视频索引和检索的静和选择 (DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval)

In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, called Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selector Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store/index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that the DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, the proposed method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. The collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.

翻译：在本文中,我们处理高性能和计算效率高的大型数据集中基于内容的视频检索问题。目前的方法通常提议:(一) 采用使用spatio-时空表达和类似计算,以高计算成本实现高性能,或(二) 代表/索引视频作为全球矢量的粗化偏差方法,即spatio-时空结构丢失,提供低性能,但计算成本较低。在这项工作中,我们提议了一个知识蒸馏速度框架,即“蒸馏和选择”(DnS),即从业绩良好的精选精选教师网络开始:a) 学生网络以高计算成本,实现高性能,或(b) 选择网络,在测试时将样本快速引导给合适的学生,以保持高的检索性能和高计算效率。我们培训了不同结构的学生,并实现了不同的性能和效率交易框架,即速度和存储要求,包括精细的配置学生,包括精选的精选精选的精选的精选式教师网络,在存储/索引中,使学生能够以更精确的存储/指数进行良好的存储,从而进行良好的存储。