Video retrieval has seen tremendous progress with the development of vision-language models. However, further improving these models require additional labelled data which is a huge manual effort. In this paper, we propose a framework MKTVR, that utilizes knowledge transfer from a multilingual model to boost the performance of video retrieval. We first use state-of-the-art machine translation models to construct pseudo ground-truth multilingual video-text pairs. We then use this data to learn a video-text representation where English and non-English text queries are represented in a common embedding space based on pretrained multilingual models. We evaluate our proposed approach on four English video retrieval datasets such as MSRVTT, MSVD, DiDeMo and Charades. Experimental results demonstrate that our approach achieves state-of-the-art results on all datasets outperforming previous models. Finally, we also evaluate our model on a multilingual video-retrieval dataset encompassing six languages and show that our model outperforms previous multilingual video retrieval models in a zero-shot setting.
翻译:视频检索在开发视觉语言模型方面取得了巨大进展。 但是,进一步改进这些模型需要额外的贴标签数据,这是一项巨大的手工工作。 在本文中,我们提出了一个MKTVR框架,利用多语种模型的知识转让来提高视频检索的性能。我们首先使用最先进的机器翻译模型来构建假的地面实况多语种视频文本配对。我们然后利用这些数据来学习视频文本代表,其中英语和非英语文本查询在基于预先培训的多语种模型的共同嵌入空间中得到代表。我们评估了我们提议的关于四个英语视频检索数据集(如MSRVTT、MSVD、DiDemo和Charades)的方法。实验结果显示,我们的方法在所有数据集上取得了比以往模型更好的最新结果。最后,我们还评估了我们关于包含六种语言的多语言视频检索数据集的模型,并显示我们的模型比以前在零镜头设置的多语种视频检索模型要好。