Commonsense is defined as the knowledge that is shared by everyone. However, certain types of commonsense knowledge are correlated with culture and geographic locations and they are only shared locally. For example, the scenarios of wedding ceremonies vary across regions due to different customs influenced by historical and religious factors. Such regional characteristics, however, are generally omitted in prior work. In this paper, we construct a Geo-Diverse Visual Commonsense Reasoning dataset (GD-VCR) to test vision-and-language models' ability to understand cultural and geo-location-specific commonsense. In particular, we study two state-of-the-art Vision-and-Language models, VisualBERT and ViLBERT trained on VCR, a standard multimodal commonsense benchmark with images primarily from Western regions. We then evaluate how well the trained models can generalize to answering the questions in GD-VCR. We find that the performance of both models for non-Western regions including East Asia, South Asia, and Africa is significantly lower than that for Western region. We analyze the reasons behind the performance disparity and find that the performance gap is larger on QA pairs that: 1) are concerned with culture-related scenarios, e.g., weddings, religious activities, and festivals; 2) require high-level geo-diverse commonsense reasoning rather than low-order perception and recognition. Dataset and code are released at https://github.com/WadeYin9712/GD-VCR.
翻译:普世知识的定义是人人共享的知识,然而,某些类型的常识知识与文化和地理位置相关,仅在当地共享。例如,由于历史和宗教因素影响的不同习俗,不同区域的婚礼仪式情景因受历史和宗教因素的影响而各有不同。然而,这些区域特征通常在先前的工作中被忽略。在本文件中,我们构建了一个地理多样性视觉常识理性数据集(GD-VCR),以测试视觉和语言模式理解文化和地理定位特有常识的能力。特别是,我们研究了两种最先进的视觉和语言模式,即VCR培训的视觉BERT和VILBERT模式。VCR是标准多式联运常识基准,主要来自西方区域。我们随后评估了经过培训的模型在多大程度上能够概括地回答GD-VCR的问题。我们发现,包括东亚、南亚和非洲在内的非西方区域两种模式的性能都大大低于西方区域。我们分析了业绩差距背后的原因,发现在VCRB/LOG-SLG-CLG-CLG-CLG-Cal-Cal-CRal-CRisional-CRisal-CRisional-LOVDRislation-Cis-Cis-CRislation-Cislations-Lislations-Lislational-Lislational-Cislations-Cis-Cislational-I) 和LOVVA-S-Cisal-Cislations-Cislation-CRislation-Cislation-Cislation-Cislents-Cislations-Cism-I)。我们分析了业绩差异差异差异差异差异差异差异差异差异差异差异差异差距和G-Cs-Cisal-Cisal-Cs-Cis-Cism-Cs-Cis-CF-Cislation-CFF-CF-CRislation-CRislation-CRisal-Cislation-Cism-Cism-Cism-Cislislis-Cislislis-S-Cislislislislation-Cis-Cislislation-Cs-C