Models for text-to-image synthesis, such as DALL-E~2 and Stable Diffusion, have recently drawn a lot of interest from academia and the general public. These models are capable of producing high-quality images that depict a variety of concepts and styles when conditioned on textual descriptions. However, these models adopt cultural characteristics associated with specific Unicode scripts from their vast amount of training data, which may not be immediately apparent. We show that by simply inserting single non-Latin characters in a textual description, common models reflect cultural stereotypes and biases in their generated images. We analyze this behavior both qualitatively and quantitatively, and identify a model's text encoder as the root cause of the phenomenon. Additionally, malicious users or service providers may try to intentionally bias the image generation to create racist stereotypes by replacing Latin characters with similarly-looking characters from non-Latin scripts, so-called homoglyphs. To mitigate such unnoticed script attacks, we propose a novel homoglyph unlearning method to fine-tune a text encoder, making it robust against homoglyph manipulations.
翻译:文本到图像合成模型,如 DALL-E~ 2 和 Splace Difulation 等文本到图像合成模型,最近引起了学术界和一般公众的极大兴趣。这些模型能够产生高质量的图像,以文字描述为条件,描述各种概念和风格。然而,这些模型采用与特定 Unicode 脚本相关的文化特征,这些特征来自其大量的培训数据,这些数据可能不会立即显现出来。我们表明,仅仅在文字描述中插入一个非拉丁字符,共同模型就反映了文化陈规定型观念和其生成图像中的偏见。我们从质量和数量上分析这一行为,并将模型的文本编码器确定为该现象的根源。此外,恶意用户或服务供应商可能试图故意偏见产生种族主义陈规定型观念,用非拉丁文的外观字符取代拉丁文中的类似字符,即所谓的同系文字。为了减轻这种不注意的脚本攻击,我们建议一种新型的同系法不学方法,对文本编码进行微调,使其对同系文字进行严格控制,使其与同系被操纵。