Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.
翻译:扩散模型已经展示出在图像生成和处理方面的出色性能,但固有的随机性在保留和操作图像内容和身份方面存在挑战。虽然以前的方法如DreamBooth和Textual Inversion提出了模型或潜在表示的个性化以维护内容,但它们对多个参考图像和复杂训练的依赖限制了它们的实用性。在本文中,我们提出了一种简单但高效的个性化方法,使用高度个性化(HiPer)文本嵌入,通过分解CLIP嵌入空间进行个性化和内容操作。我们的方法不需要模型微调或标识符,但仍可以仅使用单个图像和目标文本实现背景、纹理和动作的操作。通过对不同目标文本的实验,我们展示了我们的方法在各种任务中产生高度个性化和复杂的语义图像编辑。我们相信,本文所提出的文本嵌入空间的新理解,有潜力启发各种任务的进一步研究。