Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.
翻译:为复制性和全面性设计的可靠评价基准推动了机器学习的进展。然而,由于缺乏多语种基准,视觉和语言研究主要侧重于英语任务。为了填补这一空白,我们引入了图像环形语言理解评价基准。IGLUE通过汇集原有数据集和创建新数据集,汇集了视觉问答、跨模式检索、基于推理和有根有据的20种不同语言的连带任务。我们的基准使得能够对多语种多式联运模式进行评价,用于转移学习,不仅在零发环境中,而且在新定义的少发学习组合中。根据对现有最新模型的评估,我们发现翻译测试转移优于零发转让,而少发的学习难以用于许多任务。此外,下游业绩的部分原因在于现有的未贴标签的预培训文本数据数量,而目标源语言的排行距离则很弱。我们希望通过向社区公布基准来鼓励今后在这一领域的研究工作。