视觉语义 AI 中的标记 (Markedness in Visual Semantic AI)

We evaluate the state-of-the-art multimodal "visual semantic" model CLIP ("Contrastive Language Image Pretraining") for biases related to the marking of age, gender, and race or ethnicity. Given the option to label an image as "a photo of a person" or to select a label denoting race or ethnicity, CLIP chooses the "person" label 47.9% of the time for White individuals, compared with 5.0% or less for individuals who are Black, East Asian, Southeast Asian, Indian, or Latino or Hispanic. The model is more likely to rank the unmarked "person" label higher than labels denoting gender for Male individuals (26.7% of the time) vs. Female individuals (15.2% of the time). Age affects whether an individual is marked by the model: Female individuals under the age of 20 are more likely than Male individuals to be marked with a gender label, but less likely to be marked with an age label, while Female individuals over the age of 40 are more likely to be marked based on age than Male individuals. We also examine the self-similarity (mean pairwise cosine similarity) for each social group, where higher self-similarity denotes greater attention directed by CLIP to the shared characteristics (age, race, or gender) of the social group. As age increases, the self-similarity of representations of Female individuals increases at a higher rate than for Male individuals, with the disparity most pronounced at the "more than 70" age range. All ten of the most self-similar social groups are individuals under the age of 10 or over the age of 70, and six of the ten are Female individuals. Existing biases of self-similarity and markedness between Male and Female gender groups are further exacerbated when the groups compared are individuals who are White and Male and individuals who are Black and Female. Results indicate that CLIP reflects the biases of the language and society which produced its training data.

翻译：我们评估了与年龄、性别、种族或族裔标识有关的偏见的“语言语言图像预演”模型CLIP(“语言图像预演”)的“最先进的”多式“视觉语义”模型(“视觉语言图像预演”)。考虑到将图像标为“一个人的照片”或选择标出种族或族裔的标签的选择,CLIP选择了白人“人”标签47.9%的时间,而黑人、东亚、东南亚、印度或拉美或西班牙裔个人为5.0%或更少。该模型更可能将无标记的“人”标签排在比男性更高的标签上,而男性为26.7%的时间标出性别的“性别”标签比男性高。CLIP选择了20岁以下的女性比男性更可能标出性别标签,但更可能标出年龄更低,而40岁以上的女性个人比男性个人更可能标出更多的年龄。我们还审视了自我相似的“人”标签的“人”标签比男性个人更高。我们还审视了男性的自我相似的性别特征和性别的自我偏差,而女性的自我偏差程度比女性群体更明显地展示了70岁。