Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations.
翻译:文本到图像传播模型在生成图像时往往对世界作出隐含的假设。 虽然有些假设有用( 例如, 天空是蓝色的), 也可能过时、 不正确或反映培训数据中存在的社会偏见。 因此, 有必要控制这些假设, 不需要明确的用户投入 或昂贵的再培训 。 在这项工作中, 我们的目标是在预培训的传播模型中编辑一个给定的隐含假设。 我们的文本到图像模型编辑方法, 短时间, 接收一对投入: 一种“ 源”, 定义不足的提示, 模型可以做出一个隐含的假设( 例如, “ 一包玫瑰” ), 也可能是过时的、 不准确的、 不准确的、 不准确的。 我们的方法非常高效, 因为它只是2. 2 %, 描述同一背景的, 并且在模型中, 显示不同版本的模型中, 快速的版本, 显示我们的数据序列。</s>