Recent work has shown that despite their impressive capabilities, text-to-image diffusion models such as DALL-E 2 (Ramesh et al., 2022) can display strange behaviours when a prompt contains a word with multiple possible meanings, often generating images containing both senses of the word (Rassin et al., 2022). In this work we seek to put forward a possible explanation of this phenomenon. Using the similar Stable Diffusion model (Rombach et al., 2022), we first show that when given an input that is the sum of encodings of two distinct words, the model can produce an image containing both concepts represented in the sum. We then demonstrate that the CLIP encoder used to encode prompts (Radford et al., 2021) encodes polysemous words as a superposition of meanings, and that using linear algebraic techniques we can edit these representations to influence the senses represented in the generated images. Combining these two findings, we suggest that the homonym duplication phenomenon described by Rassin et al. (2022) is caused by diffusion models producing images representing both of the meanings that are present in superposition in the encoding of a polysemous word.
翻译:最近的工作表明,尽管其能力令人印象深刻,但文本到图像的传播模型,如DALL-E 2 (Ramesh等人,2022年)等尽管其能力令人印象深刻,但当提示包含一个具有多种可能含义的单词时,文本到图像的传播模型(Ramesh等人,2022年)可能表现出奇怪的行为,这常常产生含有该词两种感知的图像(Rassin等人,2022年),在这项工作中,我们试图提出对这一现象的可能解释。我们使用类似的稳定传播模型(Rombach等人,2022年),我们首先显示,如果输入一个包含两个不同词的编码,该模型就可以产生包含两个概念的总和的图像。我们然后表明,用于编码提示的 CLIP 编码器(Radford等人,2021年) 的编码组合单词作为含义的叠加,并且使用线性代数的代数技术,我们可以对这些表达方式进行编辑,以影响生成图像中所代表的感知感知的感官。将这两种发现结合起来,我们建议Rassin等人所描述的同性重复的现象(2022年)是由于在目前两个版本中的图像中产生一个多式的图像。