We introduce an Extended Textual Conditioning space in text-to-image models, referred to as $P+$. This space consists of multiple textual conditions, derived from per-layer prompts, each corresponding to a layer of the denoising U-net of the diffusion model. We show that the extended space provides greater disentangling and control over image synthesis. We further introduce Extended Textual Inversion (XTI), where the images are inverted into $P+$, and represented by per-layer tokens. We show that XTI is more expressive and precise, and converges faster than the original Textual Inversion (TI) space. The extended inversion method does not involve any noticeable trade-off between reconstruction and editability and induces more regular inversions. We conduct a series of extensive experiments to analyze and understand the properties of the new space, and to showcase the effectiveness of our method for personalizing text-to-image models. Furthermore, we utilize the unique properties of this space to achieve previously unattainable results in object-style mixing using text-to-image models. Project page: https://prompt-plus.github.io
翻译:我们在文本到图像模型中引入了一个扩展文本条件空间,称为$P+$。该空间包含了多个文本条件,这些条件来自扩散模型中去噪U-Net的每一层提示,每个条件都对应一个层。我们证明了扩展空间提供了更大的图像合成解缠和控制。我们进一步引入了扩展文本反转(XTI),其中图像被反向到$P+$,并由每层的token表示。我们证明了XTI比原始的文本反转(TI)空间更具表现力和精度,并且收敛速度更快。扩展反转方法不涉及重建和编辑之间的明显折衷,并引发了更规则的反转。我们进行了一系列广泛的实验来分析和理解新空间的性质,并展示了我们的方法在个性化文本到图像模型方面的有效性。此外,我们利用该空间的独特属性,在使用文本到图像模型进行对象样式混合时实现了以前无法实现的结果。项目页面:https://prompt-plus.github.io