Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.
翻译:最近引入了不理想语言图像培训前(CLIP)的桥梁图像和文本,将其嵌入共同潜伏空间。这为大量文献打开了大门,这些文献旨在通过提供文本解释来操纵输入图像。然而,由于图像和文本嵌入于共同空间,使用文字嵌入作为优化目标,常常在生成图像中引入不受欢迎的文物。分解、可解释性和可控制性也难以保证操作。为了缓解这些问题,我们提议用相关提示来定义物质子空间,以获取特定图像特征。我们引入了CLIP预测增强嵌入(PAE)作为优化目标,以改善文本制导图像操纵的性能。我们的方法是一个简单和一般的范例,可以很容易地进行计算和调整,并顺利地融入基于CLIP的图像操纵算法。为了证明我们的方法的有效性,我们进行了一些理论和经验研究。作为案例研究,我们使用了文本制导图像面编辑的方法。我们引入了一种优化目标,我们用定量和定性的精确度来解释和定性地展示了PAEAE的准确性。我们用一种更难理解和定性的状态的图像。