While several benefits were realized for multilingual vision-language pretrained models, recent benchmarks across various tasks and languages showed poor cross-lingual generalisation when multilingually pre-trained vision-language models are applied to non-English data, with a large gap between (supervised) English performance and (zero-shot) cross-lingual transfer. In this work, we explore the poor performance of these models on a zero-shot cross-lingual visual question answering (VQA) task, where models are fine-tuned on English visual-question data and evaluated on 7 typologically diverse languages. We improve cross-lingual transfer with three strategies: (1) we introduce a linguistic prior objective to augment the cross-entropy loss with a similarity-based loss to guide the model during training, (2) we learn a task-specific subnetwork that improves cross-lingual generalisation and reduces variance without model modification, (3) we augment training examples using synthetic code-mixing to promote alignment of embeddings between source and target languages. Our experiments on xGQA using the pretrained multilingual multimodal transformers UC2 and M3P demonstrate the consistent effectiveness of the proposed fine-tuning strategy for 7 languages, outperforming existing transfer methods with sparse models. Code and data to reproduce our findings are publicly available.
翻译:虽然多语种的视觉-语言先行模式取得了若干好处,但最近各种任务和语言的基准显示,在对非英语数据适用多语言先行的经过多语言培训的视觉-语言模型时,跨语言通用性化程度较差,英语业绩(监督)和跨语言转让(零点数)之间存在巨大差距。在这项工作中,我们探索这些模型在零点跨语言的视觉-视觉问题解答(VQA)任务方面表现不佳,这些模型对英语视觉-问题数据进行了微调,对7种典型的多种语言进行了评价。我们改进了跨语言的跨语言转让,这三项战略是:(1) 我们引入了语言前目标,以类似性损失来增加跨职业性损失,在培训期间指导模型,(2) 我们学习了一个针对具体任务的子网络,在不修改模式的情况下改进跨语言的概括性,减少差异。(3) 我们用合成代码混合方法来推动源和目标语言之间的结合。我们使用预先培训的多语种-多式变异器UC2和M3P对 xGQA进行了实验。我们利用预先培训的多语言变换模式进行的实验,展示了拟议的微调战略与现有7种语言的版本数据转换方法。