Diffusion models have shown superior performance in image generation and manipulation, but the inherent stochasticity presents challenges in preserving and manipulating image content and identity. While previous approaches like DreamBooth and Textual Inversion have proposed model or latent representation personalization to maintain the content, their reliance on multiple reference images and complex training limits their practicality. In this paper, we present a simple yet highly effective approach to personalization using highly personalized (HiPer) text embedding by decomposing the CLIP embedding space for personalization and content manipulation. Our method does not require model fine-tuning or identifiers, yet still enables manipulation of background, texture, and motion with just a single image and target text. Through experiments on diverse target texts, we demonstrate that our approach produces highly personalized and complex semantic image edits across a wide range of tasks. We believe that the novel understanding of the text embedding space presented in this work has the potential to inspire further research across various tasks.
翻译:传播模型在图像生成和操纵方面表现优异,但内在的随机性在保存和操纵图像内容和身份方面提出了挑战。虽然以前的方法,如DreamBooth和Textual Inversion, 已经提出了模式或潜在的个人化以维护内容内容,但依赖多个参考图像和复杂的培训限制了内容的实用性。在本文中,我们提出了一个简单而高效的个性化方法,使用高度个性化(HiPer)文本嵌入,将CLIP嵌入空间分离成个人化和内容操作。我们的方法不需要模型的微调或识别符号,但仍然能够用单一图像和目标文本来操纵背景、纹理和运动。通过对不同目标文本的实验,我们证明我们的方法产生了高度个性化和复杂的语义化图像编辑,涉及广泛的任务。我们认为,对这项工作中包含空间的案文的新理解有可能激发对各种任务的进一步研究。</s>