We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not. Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.
翻译:我们提出一套方法和设计两套基准,用于衡量语言和视觉语言模式在存在或不存在陈规定型观念的情况下在多大程度上使用视觉信号;第一个基准旨在测试共同对象的定型颜色,而第二个基准则考虑性别陈规定型观念;关键的想法是比较在图像与定型一致时的预测与在不符合定型的情况下的预测之间的预测;我们的结果显示,多式联运模式之间有很大差异:最近的基于变异器的FLAVA似乎对形象的选择比较敏感,受定型影响的程度比基于CNN的旧模型(如视觉BERT和LXMERT)要小。 这种效果在这种有控制的环境下比在传统评价中更明显,因为我们不知道该模型依赖于定型或视觉信号。