In recent years, considerable progress on the task of text-video retrieval has been achieved by leveraging large-scale pretraining on visual and audio datasets to construct powerful video encoders. By contrast, despite the natural symmetry, the design of effective algorithms for exploiting large-scale language pretraining remains under-explored. In this work, we are the first to investigate the design of such algorithms and propose a novel generalized distillation method, TeachText, which leverages complementary cues from multiple text encoders to provide an enhanced supervisory signal to the retrieval model. Moreover, we extend our method to video side modalities and show that we can effectively reduce the number of used modalities at test time without compromising performance. Our approach advances the state of the art on several video retrieval benchmarks by a significant margin and adds no computational overhead at test time. Last but not least, we show an effective application of our method for eliminating noise from retrieval datasets. Code and data can be found at https://www.robots.ox.ac.uk/~vgg/research/teachtext/.
翻译:近年来,通过利用关于视觉和音频数据集的大规模预先培训,建立强大的视频编码器,在文本视频检索任务方面取得了相当大的进展。相反,尽管自然对称,但设计利用大规模语言预科的有效算法的工作仍未得到充分探讨。在这项工作中,我们首先调查了这些算法的设计,并提出了一个全新的普遍提炼方法,TeachText,它利用多个文本编码器的补充提示,为检索模型提供强化的监督信号。此外,我们将我们的方法扩大到视频侧面模式,并表明我们能够有效地减少试验时所用的模式的数量,同时不损害性能。我们的方法以显著的边距提高若干视频检索基准的先进程度,并在测试时没有增加计算性间接费用。最后但至少,我们展示了我们从检索数据集中消除噪音的方法的有效应用。代码和数据见https://www.robots.ox.ac.uk/~vgg/reearch/tachtext。