Video retrieval is becoming increasingly important owing to the rapid emergence of videos on the Internet. The dominant paradigm for video retrieval learns video-text representations by pushing the distance between the similarity of positive pairs and that of negative pairs apart from a fixed margin. However, negative pairs used for training are sampled randomly, which indicates that the semantics between negative pairs may be related or even equivalent, while most methods still enforce dissimilar representations to decrease their similarity. This phenomenon leads to inaccurate supervision and poor performance in learning video-text representations. While most video retrieval methods overlook that phenomenon, we propose an adaptive margin changed with the distance between positive and negative pairs to solve the aforementioned issue. First, we design the calculation framework of the adaptive margin, including the method of distance measurement and the function between the distance and the margin. Then, we explore a novel implementation called "Cross-Modal Generalized Self-Distillation" (CMGSD), which can be built on the top of most video retrieval models with few modifications. Notably, CMGSD adds few computational overheads at train time and adds no computational overhead at test time. Experimental results on three widely used datasets demonstrate that the proposed method can yield significantly better performance than the corresponding backbone model, and it outperforms state-of-the-art methods by a large margin.
翻译:由于互联网上视频的迅速出现,视频检索的主导模式正在变得越来越重要。视频检索的主导模式通过将正对和负对之间的距离拉开,学习视频文本的表述。但是,用于培训的负对的抽样是随机抽样的,这表明负对的语义可能是相关或甚至同等的,而大多数方法仍然执行不同的表达方式,以减少其相似性。这种现象导致在学习视频文本演示方面的监督不准确和性能差。虽然大多数视频检索方法忽略了这一现象,但我们建议以正对和负对之间的距离来改变适应性差,以解决上述问题。首先,我们设计适应性差的计算框架,包括距离测量方法以及距离和边距之间的功能。然后,我们探索一种叫作“Cross-Modal 通用自我蒸馏”(CMGSDD)的新的实施方式,它可以建在大多数视频检索模型的顶端上,但没有多少修改。值得注意的是,CMGSDD在火车上增加了很少的计算顶部,在测试时没有增加任何计算性平流的距离。我们设计了适应性平差的计算方法。实验性地展示了三种方法,它。在三个州里行得更好。实验式的模型上,它使用了更好的结果。</s>