Recent advances in multimodal vision and language modeling have predominantly focused on the English language, mostly due to the lack of multilingual multimodal datasets to steer modeling efforts. In this work, we address this gap and provide xGQA, a new multilingual evaluation benchmark for the visual question answering task. We extend the established English GQA dataset to 7 typologically diverse languages, enabling us to detect and explore crucial challenges in cross-lingual visual question answering. We further propose new adapter-based approaches to adapt multimodal transformer-based models to become multilingual, and -- vice versa -- multilingual models to become multimodal. Our proposed methods outperform current state-of-the-art multilingual multimodal models (e.g., M3P) in zero-shot cross-lingual settings, but the accuracy remains low across the board; a performance drop of around 38 accuracy points in target languages showcases the difficulty of zero-shot cross-lingual transfer for this task. Our results suggest that simple cross-lingual transfer of multimodal models yields latent multilingual multimodal misalignment, calling for more sophisticated methods for vision and multilingual language modeling. The xGQA dataset is available online at: https://github.com/Adapter-Hub/xGQA.
翻译:在这项工作中,我们缩小了这一差距,并提供XGQA,这是用于直观回答问题的新的多语种评价基准。我们把已建立的英语GQA数据集扩大到7种类型多样的语言,使我们能够发现和探索跨语言视觉问题解答的重大挑战。我们进一步提议新的基于适应基于多式联运的变压器模式以成为多语种模式的适应新适应者办法,以及 -- -- 反之亦然 -- -- 多语种模式成为多式联运。我们提出的方法在零点跨语言环境中超过了目前最先进的多语种多语种模式(如M3P),但全局的准确性仍然很低;目标语言的准确性下降率约为38个,展示了为此任务零点跨语言传输的难度。我们的结果表明,简单跨语言的多式联运模式转移会产生潜在的多语种多语种多式联运错配关系,需要更复杂的视觉和多语种语言建模方法。 XGQA数据设置在网上: http://Gs/G/Gs/Q/Q/Q/OBA)。