We examine the state-of-the-art multimodal "visual semantic" model CLIP ("Contrastive Language Image Pretraining") for the rule of hypodescent, or one-drop rule, whereby multiracial people are more likely to be assigned a racial or ethnic label corresponding to a minority or disadvantaged racial or ethnic group than to the equivalent majority or advantaged group. A face morphing experiment grounded in psychological research demonstrating hypodescent indicates that, at the midway point of 1,000 series of morphed images, CLIP associates 69.7% of Black-White female images with a Black text label over a White text label, and similarly prefers Latina (75.8%) and Asian (89.1%) text labels at the midway point for Latina-White female and Asian-White female morphs, reflecting hypodescent. Additionally, assessment of the underlying cosine similarities in the model reveals that association with White is correlated with association with "person," with Pearson's rho as high as 0.82 over a 21,000-image morph series, indicating that a White person corresponds to the default representation of a person in CLIP. Finally, we show that the stereotype-congruent pleasantness association of an image correlates with association with the Black text label in CLIP, with Pearson's rho = 0.48 for 21,000 Black-White multiracial male images, and rho = 0.41 for Black-White multiracial female images. CLIP is trained on English-language text gathered using data collected from an American website (Wikipedia), and our findings demonstrate that CLIP embeds the values of American racial hierarchy, reflecting the implicit and explicit beliefs that are present in human minds. We contextualize these findings within the history and psychology of hypodescent. Overall, the data suggests that AI supervised using natural language will, unless checked, learn biases that reflect racial hierarchies.
翻译:我们检查了最先进的多式联运“视觉语义”模型CLIP(“Contractive 语言图像预演”),以显示低盲规则,或一滴规则,这样多种族的人更有可能被分配到与少数或处境不利的种族或族裔群体相对的种族或族裔标签,而不是与同等的多数或优势群体相对应的种族或族裔群体。基于心理研究的表面变形实验显示,在1000系列变形图像的中间点,CLIP将69.7%的黑白女性图像与白文本标签比白文本标签高的黑白女性图像联系起来,同样喜欢拉丁语(75.8%)和亚裔(89.1%)的文字标签,这反映了拉丁-白种女性和亚裔女性在中间的种族或种族标签标签,这显示了我们目前与“人”的“人”关系,Pearson's Rhoondald 和21摄氏模系列的言词高0.82, 表明白种-直径的性别-直径图像代表了我们所了解的C-直径 和黑种-直径的图像。