Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images. An intriguing yet challenging problem arises: Can generative models achieve counterfactual editing against their learnt priors? Due to the lack of counterfactual samples in natural datasets, we investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which can offer rich semantic knowledge even for various counterfactual concepts. Different from in-domain manipulation, counterfactual manipulation requires more comprehensive exploitation of semantic knowledge encapsulated in CLIP as well as more delicate handling of editing directions for avoiding being stuck in local minimum or undesired editing. To this end, we design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives. In addition, we design a simple yet effective scheme that explicitly maps CLIP embeddings (of target text) to the latent space and fuses them with latent codes for effective latent code optimization and accurate editing. Extensive experiments show that our design achieves accurate and realistic editing while driving by target texts with various counterfactual concepts.
翻译:利用现有方法可以对面部图像的年龄和性别等不同视觉属性进行现实的编辑。 一个令人感兴趣的、但却具有挑战性的问题出现: 基因模型能否针对所学的前科进行反事实编辑? 由于缺乏自然数据集中的反事实样本,我们用文字驱动的方式与相悖-语言-图像-前导师(CLIP)一起调查这一问题,这甚至可以为各种反事实概念提供丰富的语义知识。 不同于内部操纵,反事实操作需要更全面地利用CLIP所包罗的语义知识,以及更微妙地处理编辑方向以避免被当地最低限度或不理想的编辑。 为此,我们设计了新的对比性损失,利用预先定义的CLIP-空间方向来指导编辑走向不同角度的预期方向。 此外,我们设计了一个简单而有效的计划,将CLIP(目标文本)嵌入到隐性空间和链接中,需要更全面地利用隐性的空间和链接,同时以真实的代码进行真实的修改,同时以真实的、精确的版本来显示我们的目标性设计。