ReVersion: 从图像中基于扩散的关系反演 (ReVersion: Diffusion-Based Relation Inversion from Images)

from arxiv, First two authors contributed equally. Project page: https://ziqihuangg.github.io/projects/reversion.html Code: https://github.com/ziqihuangg/ReVersion

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion models from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How to invert object relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion for the Relation Inversion task, which aims to learn a specific relation (represented as "relation prompt") from exemplar images. Specifically, we learn a relation prompt from a frozen pre-trained text-to-image diffusion model. The learned relation prompt can then be applied to generate relation-specific images with new objects, backgrounds, and styles. Our key insight is the "preposition prior" - real-world relation prompts can be sparsely activated upon a set of basis prepositional words. Specifically, we propose a novel relation-steering contrastive learning scheme to impose two critical properties of the relation prompt: 1) The relation prompt should capture the interaction between objects, enforced by the preposition prior. 2) The relation prompt should be disentangled away from object appearances. We further devise relation-focal importance sampling to emphasize high-level interactions over low-level appearances (e.g., texture, color). To comprehensively evaluate this new task, we contribute ReVersion Benchmark, which provides various exemplar images with diverse relations. Extensive experiments validate the superiority of our approach over existing methods across a wide range of visual relations.

翻译：扩散模型由于其生成能力而越来越受欢迎。最近，有了逐渐增加的需求，即通过从示例图像中反演扩散模型生成定制图像。然而，现有的反演方法主要集中在捕捉对象外观。如何反演对象关系，即视觉世界中的另一个重要支柱，仍未被探索。在这项工作中，我们提出了ReVersion用于关系反演任务，旨在从示例图像中学习特定关系（表示为“关系提示”）。具体而言，我们从被冻结的预先训练的文本到图像扩散模型中学习关系提示。学习到的关系提示随后可用于生成具有新对象、背景和风格的关系特定图像。我们的关键见解是“介词先验”-真实世界的关系提示可以在一组基础介词单词上稀疏激活。具体来说，我们提出了一种新的关系导向对比学习方案，以强制执行关系提示的两个关键属性：1）关系提示应捕捉对象之间的交互，由介词先验强制执行。2）关系提示应与对象外观分离。我们进一步设计了关注关系的重要性抽样来强调高层交互而不是低层外观（例如纹理、颜色）。为了全面评估这个新任务，我们贡献了ReVersion基准测试，提供了各种具有不同关系的示例图像。广泛的实验验证了我们的方法在各种视觉关系中优于现有方法。