Text-guided image generation models, such as DALL-E 2 and Stable Diffusion, have recently received much attention from academia and the general public. Provided with textual descriptions, these models are capable of generating high-quality images depicting various concepts and styles. However, such models are trained on large amounts of public data and implicitly learn relationships from their training data that are not immediately apparent. We demonstrate that common multimodal models implicitly learned cultural biases that can be triggered and injected into the generated images by simply replacing single characters in the textual description with visually similar non-Latin characters. These so-called homoglyph replacements enable malicious users or service providers to induce biases into the generated images and even render the whole generation process useless. We practically illustrate such attacks on DALL-E 2 and Stable Diffusion as text-guided image generation models and further show that CLIP also behaves similarly. Our results further indicate that text encoders trained on multilingual data provide a way to mitigate the effects of homoglyph replacements.
翻译:以文字制导的图像生成模型,如DALL-E 2 和 Splace Difilation 等,最近受到学术界和一般公众的极大关注。这些模型提供文字描述,能够产生反映各种概念和风格的高质量图像。然而,这些模型在大量公共数据方面受过培训,并隐含地从其培训数据中学习了关系,而这些数据并不立即明显。我们证明,共同的多式联运模型隐含了文化偏见,这些偏见可以通过简单地用视觉上类似的非拉丁字符取代文字描述中的单一字符而触发并注入生成的图像中。这些所谓的同质替换使恶意用户或服务供应商能够对生成的图像产生偏见,甚至使整个生成过程变得无用。我们实际上举例说明了对DALL-E 2 和 Stagretal Difpilation 的这种攻击,作为文本制导图像生成模型,并进一步表明CLIP 也表现类似。我们的结果进一步表明,经过多语种数据培训的文字编码器提供了一种减轻同质替换效应的方法。