The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for {M}ulticultur{a}l {R}easoning over {V}ision and {L}anguage (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
翻译:设计广泛的视觉和语言数据集和经过预先训练的多语言类集,直接采用图像网络的概念和图像,或从这些概念和图像中得到启发。虽然人们几乎无法高估这一基准对计算机愿景的进展贡献多少,但大部分来自英文的词汇数据库和图像查询,从而产生了北美或西欧偏差的原始材料。因此,我们设计了一个新的程序,以构建一个代表更多语言和文化的图像网络式的层次结构。特别是,我们让本地语言者完全驱动概念和图像的选择,而不是自动筛选这些概念和图像。具体地说,我们侧重于一组典型的多种语言,即印度尼西亚语、汉语中文、斯瓦希里语、泰米尔语和土耳其语。在通过这一新协议获得的概念和图像的顶端,我们为{multiulturtur{a}}l{R}{R}scouncounity sets 代表更多的语言和文化。我们让本地语言和语言系统(MaRVL)通过从本地演讲者那里获得关于当前图像组合的说明。我们的任务包括了真实的准确性,我们用每套标准转换的成绩来建立真实性基线,我们用真实性基准来确定真实性,或者快速的英文级模型。