Visual Question Answering (VQA) has been primarily studied through the lens of the English language. Yet, tackling VQA in other languages in the same manner would require considerable amount of resources. In this paper, we propose scalable solutions to multi-lingual visual question answering (mVQA), on both data and modeling fronts. We first propose a translation-based framework to mVQA data generation that requires much less human annotation efforts than the conventional approach of directly collection questions and answers. Then, we apply our framework to the multi-lingual captions in the Crossmodal-3600 dataset and develop an efficient annotation protocol to create MAVERICS-XM3600 (MaXM), a test-only VQA benchmark in 7 diverse languages. Finally, we propose an approach to unified, extensible, open-ended, and end-to-end mVQA modeling and demonstrate strong performance in 13 languages.
翻译:视觉问题解答(VQA)主要通过英语的透镜来研究,然而,以同样的方式处理其他语言的VQA需要大量的资源。在本文件中,我们提出了在数据和建模两个方面多语种视觉问题解答(mVQA)的可扩展解决方案。我们首先为MVQA数据生成提出了一个基于翻译的框架,这比直接收集问答的常规方法更不需要人类的批注努力。然后,我们将我们的框架应用于交叉模式-3600数据集中的多语种字幕,并制定一个高效的批注协议,以创建MAVERICS-XM3600(MXM),这是一个只有测试的七种不同语言的VQA基准。最后,我们提出了一种统一、可扩展、开放和端对端MVQA模型化的方法,并以13种语言展示强大的性能。