Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency.
翻译:扩散模型最近成为一种强大的生成工具。尽管取得了很大进展,但现有的扩散模型主要集中在单模态控制上,即扩散过程仅由条件的一种模态驱动。为了进一步释放用户的创造力,希望模型能够同时通过多种模态进行控制,例如通过描述年龄(文本驱动)来绘制面部形状(掩码驱动)进行生成和编辑。在这项工作中,我们提出了基于协作扩散的方法,其中预训练的单模态扩散模型协作实现多模态人脸生成和编辑而无需重新训练。我们的主要思想是,由不同模态驱动的扩散模型在潜在去噪步骤中具有互补性,可以建立双向连接。具体而言,我们提出了动态扩散器,这是一个元网络,根据每个预训练的单模态模型预测每个模型的时空影响函数,从而自适应地幻觉多模态去噪步骤。协作扩散不仅协作了来自单模态扩散模型的生成能力,还集成了多个单模态操作以执行多模态编辑。广泛的定性和定量实验表明了我们框架在图像质量和条件一致性方面的优越性。