We tackle the problem of target-free text-guided image manipulation, which requires one to modify the input reference image based on the given text instruction, while no ground truth target image is observed during training. To address this challenging task, we propose a Cyclic-Manipulation GAN (cManiGAN) in this paper, which is able to realize where and how to edit the image regions of interest. Specifically, the image editor in cManiGAN learns to identify and complete the input image, while cross-modal interpreter and reasoner are deployed to verify the semantic correctness of the output image based on the input instruction. While the former utilizes factual/counterfactual description learning for authenticating the image semantics, the latter predicts the "undo" instruction and provides pixel-level supervision for the training of cManiGAN. With such operational cycle-consistency, our cManiGAN can be trained in the above weakly supervised setting. We conduct extensive experiments on the datasets of CLEVR and COCO, and the effectiveness and generalizability of our proposed method can be successfully verified. Project page: https://sites.google.com/view/wancyuanfan/projects/cmanigan.
翻译:我们处理无目标文本引导图像操纵问题,这需要根据给定文本指令修改输入参考图像,而培训期间没有观察到地面真实目标图像。为了应对这一具有挑战性的任务,我们提议在本文中设置一个Cyclic-ManiGAN(cManiGAN),它能够了解哪些地方和如何编辑感兴趣的图像区域。具体地说,cManiGAN的图像编辑学会识别和完成输入图像,同时使用跨模式的翻译和解释器,以核实根据输入指令输出图像的语义正确性。前者利用事实/对应事实描述学习来认证图像语义学,而后者预测“undo”教学,并为培训cmaniGAN提供像素级监督。有了这种操作周期一致性,我们的cmaniGAN可以在上述薄弱的监管环境中接受培训。我们在CLEVR和COCO的数据集上进行广泛的实验,以及我们拟议的方法的有效性和可概括性/可概括性。 ALGOA/IANPES/Projisional pages。 http://qualisal/ aromaismaisal.