Large-scale diffusion neural networks represent a substantial milestone in text-to-image generation, but they remain poorly understood, lacking interpretability analyses. In this paper, we perform a text-image attribution analysis on Stable Diffusion, a recently open-sourced model. To produce pixel-level attribution maps, we upscale and aggregate cross-attention word-pixel scores in the denoising subnetwork, naming our method DAAM. We evaluate its correctness by testing its semantic segmentation ability on nouns, as well as its generalized attribution quality on all parts of speech, rated by humans. We then apply DAAM to study the role of syntax in the pixel space, characterizing head--dependent heat map interaction patterns for ten common dependency relations. Finally, we study several semantic phenomena using DAAM, with a focus on feature entanglement, where we find that cohyponyms worsen generation quality and descriptive adjectives attend too broadly. To our knowledge, we are the first to interpret large diffusion models from a visuolinguistic perspective, which enables future lines of research. Our code is at https://github.com/castorini/daam.
翻译:大规模扩散神经网络是文字到图像生成过程中的一个重大里程碑,但是它们仍然不易理解,缺乏解释性分析。 在本文中,我们对最近开源的模型“稳定扩散”进行文本图像属性分析。为了制作像素级归属图,我们在解密子网络中提升和综合交叉关注单象分数,命名我们的DAAM方法。我们通过测试其名词上的语义分解能力以及人类评定的所有部分语言的普遍归属质量来评价其正确性。然后我们应用DAAM来研究语法在像素空间中的作用,为十种共同依赖关系说明依赖头的热图互动模式。最后,我们利用DAAM研究几种语法现象,重点是特征纠缠,我们发现其中的混杂性使生成质量恶化,描述性形容性形容性反应过宽。根据我们的知识,我们首先从反引力学角度来解释大型传播模型,使未来研究得以进行。 我们的代码是 httpscastrobastimini。