Modern video-text retrieval frameworks basically consist of three parts: video encoder, text encoder and the similarity head. With the success on both visual and textual representation learning, transformer based encoders and fusion methods have also been adopted in the field of video-text retrieval. In this report, we present CLIP2TV, aiming at exploring where the critical elements lie in transformer based methods. To achieve this, We first revisit some recent works on multi-modal learning, then introduce some techniques into video-text retrieval, finally evaluate them through extensive experiments in different configurations. Notably, CLIP2TV achieves 52.9@R1 on MSR-VTT dataset, outperforming the previous SOTA result by 4.1%.
翻译:现代视频文本检索框架基本上由三部分组成:视频编码器、文本编码器和类似内容头。随着视觉和文字表述学习的成功,在视频文本检索领域也采用了基于变压器的编码器和聚合方法。在本报告中,我们介绍了CLIP2TV,旨在探索变压器方法中关键要素的位置。为了实现这一点,我们首先重新审视一些关于多模式学习的近期工作,然后将一些技术引入视频文本检索,最后通过不同配置的广泛实验对其进行评估。值得注意的是,CLIP2TV在MSR-VTT数据集上取得了52.9@R1的成绩,比以前SOTA的结果高出4.1%。