高效交叉查看视频检索的混合对比性定量 (Hybrid Contrastive Quantization for Efficient Cross-View Video Retrieval)

With the recent boom of video-based social platforms (e.g., YouTube and TikTok), video retrieval using sentence queries has become an important demand and attracts increasing research attention. Despite the decent performance, existing text-video retrieval models in vision and language communities are impractical for large-scale Web search because they adopt brute-force search based on high-dimensional embeddings. To improve efficiency, Web search engines widely apply vector compression libraries (e.g., FAISS) to post-process the learned embeddings. Unfortunately, separate compression from feature encoding degrades the robustness of representations and incurs performance decay. To pursue a better balance between performance and efficiency, we propose the first quantized representation learning method for cross-view video retrieval, namely Hybrid Contrastive Quantization (HCQ). Specifically, HCQ learns both coarse-grained and fine-grained quantizations with transformers, which provide complementary understandings for texts and videos and preserve comprehensive semantic information. By performing Asymmetric-Quantized Contrastive Learning (AQ-CL) across views, HCQ aligns texts and videos at coarse-grained and multiple fine-grained levels. This hybrid-grained learning strategy serves as strong supervision on the cross-view video quantization model, where contrastive learning at different levels can be mutually promoted. Extensive experiments on three Web video benchmark datasets demonstrate that HCQ achieves competitive performance with state-of-the-art non-compressed retrieval methods while showing high efficiency in storage and computation. Code and configurations are available at https://github.com/gimpong/WWW22-HCQ.

翻译：随着最近基于视频的社会平台(例如YouTube和TikTok)的兴起,使用句号查询的视频检索已成为一项重要的需求,并吸引了越来越多的研究关注。尽管业绩良好,但视觉和语言社区现有的文本视频检索模型对于大规模网络搜索来说是不切实际的,因为它们采用基于高维嵌入的粗力搜索。为了提高效率,网络搜索引擎广泛将矢量压缩图书馆(例如FAISS)用于处理学习过的嵌入。不幸的是,将特性编码与特性编码分开,会降低演示的稳健性,并导致性能腐蚀。为了更好地平衡业绩和效率,我们建议了视觉和语言社区中现有的文本和视频检索模型(例如,FAISIS)的第一个四分解代表制代表制学习方法用于交叉浏览视频检索,即混合对比量定量定量(HCQQ) 将高压缩缩缩缩缩缩缩缩略图文本和精细微缩略图解缩缩图解算系统(在高压的浏览中展示高超压的图像和高压缩缩缩缩缩缩缩缩缩图)。