The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. Project page is available at https://magicfusion.github.io/.
翻译:开源AI社区的出现产生了大量在各种数据集上训练的强大的文本引导扩散模型。虽然人们已经尝试通过组合这些模型来减少它们的缺点,但还没有进行深入的研究。在本文中,我们提出了一种简单而有效的方法,称为感知显著性噪声融合(SNB),可以使融合的文本引导扩散模型实现更可控的生成。具体来说,我们通过实验发现,由分类器无关引导所产生的响应与生成的图像的显著性高度相关。因此,我们提出通过感知显著性的方式将两种扩散模型预测的噪声混合在一起,以信任各自在自己应用领域上的专长。SNB无需其他注释(如遮罩)即可完成DDIM采样过程,自动对齐两个噪声空间的语义。大量实验结果显示,SNB在各种应用中的有效性令人印象深刻。项目主页可在https://magicfusion.github.io/上查看。