This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextualized multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (MultiHowTo100M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100M is available at http://github.com/berniebear/Multi-HT100M.
翻译:本文研究视觉语言模式的跨语言交流零点。 具体地说, 我们侧重于多语种文本到视频搜索, 并提议一个基于变异器的模式, 学习背景化多语种多式联运嵌入。 在零点设置下, 我们从经验上证明, 当我们查询多语种文本视频模式, 加上非英语句子时, 业绩会显著下降。 为了解决这个问题, 我们引入了多语种多语种培训预培训战略, 并收集了一个新的多语种教学视频数据集( MultiHOWOT100M ) 用于预培训 。 在 VTTT的实验中显示, 我们的方法大大改善了非英语语言的视频搜索, 而没有附加说明。 此外, 当多语种说明可用时, 我们的方法在VTTT和VATEX的多语种文本到视频搜索中, 以及多语种文本到图像搜索中, 都大大超越了我们的方法的近期基线 。 我们的模型和多语种视频视频100M可以在 http://github.com/berninefear/ Multi-HT100M上查阅 。