Most vision-and-language pretraining research focuses on English tasks. However, the creation of multilingual multimodal evaluation datasets (e.g. Multi30K, xGQA, XVNLI, and MaRVL) poses a new challenge in finding high-quality training data that is both multilingual and multimodal. In this paper, we investigate whether machine translating English multimodal data can be an effective proxy for the lack of readily available multilingual data. We call this framework TD-MML: Translated Data for Multilingual Multimodal Learning, and it can be applied to any multimodal dataset and model. We apply it to both pretraining and fine-tuning data with a state-of-the-art model. In order to prevent models from learning from low-quality translated text, we propose two metrics for automatically removing such translations from the resulting datasets. In experiments on five tasks across 20 languages in the IGLUE benchmark, we show that translated data can provide a useful signal for multilingual multimodal learning, both at pretraining and fine-tuning.
翻译:多数愿景和语言培训前研究都侧重于英语任务,然而,多语种多语种评价数据集(例如多语种30K、xGQA、XVNLI和MARVL)的创建对寻找高质量多语种和多语种的培训数据提出了新的挑战。在本文中,我们调查机器翻译英语多式联运数据是否可有效替代缺乏容易获得的多语种数据的问题。我们称这个框架TD-MML:多语种多语种多语种学习翻译数据,可应用于任何多式联运数据集和模型。我们将其应用到采用最先进的模型的预培训和微调数据中。为了防止模式从低质量翻译文本中学习,我们建议用两种衡量标准自动删除由此产生的数据集中的这种翻译。在IGLUE基准中,关于20种语言的5项任务的实验中,我们表明,翻译数据可以为多语种多式联运学习提供有用的信号,无论是在培训前还是微调。