Pre-trained models with dual and cross encoders have shown remarkable success in propelling the landscape of several tasks in vision and language in Visual Question Answering (VQA). However, since they are limited by the requirements of gold annotated data, most of these advancements do not see the light of day in other languages beyond English. We aim to address this problem by introducing a curriculum based on the source and target language translations to finetune the pre-trained models for the downstream task. Experimental results demonstrate that script plays a vital role in the performance of these models. Specifically, we show that target languages that share the same script perform better (~6%) than other languages and mixed-script code-switched languages perform better than their counterparts (~5-12%).
翻译:具有双重和交叉编码器的经过预先培训的模型在推进视觉问题解答(VQA)中视觉和语言的若干任务方面表现出了显著的成功。然而,由于这些模型受到黄金附加说明数据要求的限制,大多数这些进步没有看到除英文以外的其他语言的白昼光。我们的目标是通过引入基于来源的课程来解决这一问题,并针对语言翻译来微调经过培训的下游任务模型。实验结果显示,文字在这些模型的运行中发挥着至关重要的作用。具体地说,我们表明,拥有相同文字的目标语言比其他语言表现得更好(~6% ), 并使用混合手法的密码转换语言比其他语言表现更好(~ 5-12 % )。