We explore the extent to which zero-shot vision-language models exhibit gender bias for different vision tasks. Vision models traditionally required task-specific labels for representing concepts, as well as finetuning; zero-shot models like CLIP instead perform tasks with an open-vocabulary, meaning they do not need a fixed set of labels, by using text embeddings to represent concepts. With these capabilities in mind, we ask: Do vision-language models exhibit gender bias when performing zero-shot image classification, object detection and semantic segmentation? We evaluate different vision-language models with multiple datasets across a set of concepts and find (i) all models evaluated show distinct performance differences based on the perceived gender of the person co-occurring with a given concept in the image and that aggregating analyses over all concepts can mask these concerns; (ii) model calibration (i.e. the relationship between accuracy and confidence) also differs distinctly by perceived gender, even when evaluating on similar representations of concepts; and (iii) these observed disparities align with existing gender biases in word embeddings from language models. These findings suggest that, while language greatly expands the capability of vision tasks, it can also contribute to social biases in zero-shot vision settings. Furthermore, biases can further propagate when foundational models like CLIP are used by other models to enable zero-shot capabilities.
翻译:我们探索了零射视觉语言模型对不同视觉任务表现出性别偏见的程度; 视觉模型传统上要求为代表概念和微调提供特定任务标签; 像 CLIP 这样的零射模型以开放词汇执行任务,意味着它们不需要固定的标签,使用文字嵌入来代表概念; 我们考虑到这些能力,问: 视觉语言模型在进行零射图像分类、对象探测和语义分解时是否表现出性别偏见? 我们评估不同视觉语言模型,在一套概念和发现(一) 所有被评估的模型都显示不同的性能差异,其依据是人与特定概念一致的感知性别,并且对所有概念的综合分析可以掩盖这些关切;(二) 模型校正(即准确性和信心之间的关系)也因感知的性别而不同,即使在对类似概念的表达方式进行评估时;以及 (三) 这些观察到的差别与语言模型中的现有性别偏见相一致,并发现:(一) 所有被评估的模型都显示不同的性性差异,其依据是人们所认为的性别特征与特定概念相同,同时语言也能极大地扩展C-基础的定位能力。