The Vision-Language Pre-training (VLP) models like CLIP have gained popularity in recent years. However, many works found that the social biases hidden in CLIP easily manifest in downstream tasks, especially in image retrieval, which can have harmful effects on human society. In this work, we propose FairCLIP to eliminate the social bias in CLIP-based image retrieval without damaging the retrieval performance achieving the compatibility between the debiasing effect and the retrieval performance. FairCLIP is divided into two steps: Attribute Prototype Learning (APL) and Representation Neutralization (RN). In the first step, we extract the concepts needed for debiasing in CLIP. We use the query with learnable word vector prefixes as the extraction structure. In the second step, we first divide the attributes into target and bias attributes. By analysis, we find that both attributes have an impact on the bias. Therefore, we try to eliminate the bias by using Re-Representation Matrix (RRM) to achieve the neutralization of the representation. We compare the debiasing effect and retrieval performance with other methods, and experiments demonstrate that FairCLIP can achieve the best compatibility. Although FairCLIP is used to eliminate bias in image retrieval, it achieves the neutralization of the representation which is common to all CLIP downstream tasks. This means that FairCLIP can be applied as a general debiasing method for other fairness issues related to CLIP.
翻译:类似CLIP的“视觉语言培训前(VLP)”模式(VLP)近年来受到欢迎。然而,许多工作发现,CLIP中隐藏的社会偏见很容易地体现在下游任务中,特别是图像检索中,这对人类社会产生有害影响。在这项工作中,我们建议FairCLIP消除基于CLIP的图像检索中的社会偏见,同时不影响在降低偏见效应和检索性能之间实现兼容性的检索业绩。FairCLIP分为两个步骤:属性原型学习(APL)和代表中立(RN)。在第一步,我们从CLIP中提取减少偏见所需的概念。我们用可学习的词矢量前的查询作为提取结构。在第二步中,我们首先将属性分为目标和偏差属性,同时不影响检索性能实现偏差效应和检索性;因此,我们试图通过再展示矩阵(RRMM)实现代表的中性化。我们把这种偏差效应和回收业绩与CLIP的相对性效果与其他方法作比较,而公平化的CLIP则是消除了整个CLIP的兼容性和实验。