Multi-turn compositional image generation (M-CIG) is a challenging task that aims to iteratively manipulate a reference image given a modification text. While most of the existing methods for M-CIG are based on generative adversarial networks (GANs), recent advances in image generation have demonstrated the superiority of diffusion models over GANs. In this paper, we propose a diffusion-based method for M-CIG named conditional denoising diffusion with image compositional matching (CDD-ICM). We leverage CLIP as the backbone of image and text encoders, and incorporate a gated fusion mechanism, originally proposed for question answering, to compositionally fuse the reference image and the modification text at each turn of M-CIG. We introduce a conditioning scheme to generate the target image based on the fusion results. To prioritize the semantic quality of the generated target image, we learn an auxiliary image compositional match (ICM) objective, along with the conditional denoising diffusion (CDD) objective in a multi-task learning framework. Additionally, we also perform ICM guidance and classifier-free guidance to improve performance. Experimental results show that CDD-ICM achieves state-of-the-art results on two benchmark datasets for M-CIG, i.e., CoDraw and i-CLEVR.
翻译:摘要:多轮组合图像生成(M-CIG)是一项具有挑战性的任务,旨在通过给定的修改文本迭代地操作参考图像。尽管大多数基于生成对抗网络(GAN)的M-CIG现有方法在处理该问题时表现良好,但最近图像生成的进展表明,扩散模型优于GAN。在本文中,我们提出了一种基于扩散的M-CIG方法,称为带图像组合匹配的条件去噪扩散(CDD-ICM)。我们利用CLIP作为图像和文本编码器的基础,并在每轮M-CIG时结合最初用于问答的门控融合机制来组合融合结果的参考图像和修改文本。我们引入一种条件方案,以基于融合结果生成目标图像。为了优先考虑生成目标图像的语义质量,我们在多任务学习框架中学习了一个辅助的图像组合匹配(ICM)目标和条件去噪扩散(CDD)目标。此外,我们还执行了ICM指导和无分类器指导以改善性能。实验结果表明,CDD-ICM在两个M-CIG基准数据集CoDraw和i-CLEVR上实现了最先进的结果。