CLIP: 高效的文本视频检索当量集群 (CenterCLIP: Token Clustering for Efficient Text-Video Retrieval)

Recently, large-scale pre-training methods like CLIP have made great progress in multi-modal research such as text-video retrieval. In CLIP, transformers are vital for modeling complex multi-modal relations. However, in the vision transformer of CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive and similar frames in videos. This significantly increases computation costs and hinders the deployment of video retrieval models in web applications. In this paper, to reduce the number of redundant video tokens, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones. As the frame redundancy occurs mostly in consecutive frames, we divide videos into multiple segments and conduct segment-level clustering. Center tokens from each segment are later concatenated into a new sequence, while their original spatial-temporal relations are well maintained. We instantiate two clustering algorithms to efficiently find deterministic medoids and iteratively partition groups in high dimensional space. Through this token clustering and center selection procedure, we successfully reduce computation costs by removing redundant visual tokens. This method further enhances segment-level semantic alignment between video and text representations, enforcing the spatio-temporal interactions of tokens from within-segment frames. Our method, coined as CenterCLIP, surpasses existing state-of-the-art by a large margin on typical text-video benchmarks, while reducing the training memory cost by 35\% and accelerating the inference speed by 14\% at the best case. The code is available at \href{{https://github.com/mzhaoshuai/CenterCLIP}}{{https://github.com/mzhaoshuai/CenterCLIP}}.

翻译：最近,像 CLIP 这样的大规模培训前方法在文本视频检索等多模式研究方面取得了巨大进步。在 CLIP 中,变压器对于模拟复杂的多模式关系至关重要。但是, 在 CLIP 的视觉变压器中, 基本视觉象征性化进程, 产生离散视觉象征性序列, 产生许多同质的符号, 原因是连续和类似视频框的冗余性质。这大大增加了计算成本, 并阻碍了视频检索模型在网络应用程序中的部署。在本文中, 为了减少多余的视频标牌的数量, 我们设计了一个多组合式符号组合算法, 以找到最有代表性的标志, 并丢弃非必要的标码。然而, 由于框架的冗余力大多在连续的框中发生, 我们将视频的视频转换程序分成多个部分, 中心标记后来被凝固成一个新的序列, 而它们原来的空间- 状态- 时间关系得到了很好的保持。我们立即将两种组合算法从高维空间的确定性类比和迭式分区分区组中找到确定性的数据/ 。通过这个数字级 CLOVIRC 和中心选择程序, 我们成功地计算了SOC- 降低了现在的变压的变压的变压和中间的变压成本。