Visual question answering (VQA) is one of the crucial vision-and-language tasks. Yet, the bulk of research until recently has focused only on the English language due to the lack of appropriate evaluation resources. Previous work on cross-lingual VQA has reported poor zero-shot transfer performance of current multilingual multimodal Transformers and large gaps to monolingual performance, attributed mostly to misalignment of text embeddings between the source and target languages, without providing any additional deeper analyses. In this work, we delve deeper and address different aspects of cross-lingual VQA holistically, aiming to understand the impact of input data, fine-tuning and evaluation regimes, and interactions between the two modalities in cross-lingual setups. 1) We tackle low transfer performance via novel methods that substantially reduce the gap to monolingual English performance, yielding +10 accuracy points over existing transfer methods. 2) We study and dissect cross-lingual VQA across different question types of varying complexity, across different multilingual multi-modal Transformers, and in zero-shot and few-shot scenarios. 3) We further conduct extensive analyses on modality biases in training data and models, aimed to further understand why zero-shot performance gaps remain for some question types and languages. We hope that the novel methods and detailed analyses will guide further progress in multilingual VQA.
翻译:视觉问题解答(VQA)是关键的愿景和语言任务之一。然而,由于缺少适当的评价资源,直到最近为止的大部分研究只侧重于英语语言,而以前关于跨语言语言的VQA的工作报告说,目前多语言多式联运变异器的零弹传输性能差,而且单语化表现存在巨大差距,主要原因是源与目标语言之间的文本嵌入不协调,而没有提供任何进一步的更深入的分析。在这项工作中,我们更深入地研究并处理跨语言的VQA的不同方面,目的是了解投入数据、微调和评价制度以及两种模式在跨语言组合中互动的影响。 (1) 我们通过新颖的方法处理低传异性工作,将差距大大缩小到英语单语化,使现有的传输方法达到+10的准确点。(2) 我们研究不同复杂程度不同的问题类型,跨越不同的多种语言的多式变异变异器,并在零弹决和几发的情景中处理。(3) 我们进一步广泛分析培训数据和模型中的模式偏差,通过新颖的方法处理低的转移性业绩,目的是进一步理解五语言的零弹道分析。