Diffusion probabilistic models (DPMs) have become a popular approach to conditional generation, due to their promising results and support for cross-modal synthesis. A key desideratum in conditional synthesis is to achieve high correspondence between the conditioning input and generated output. Most existing methods learn such relationships implicitly, by incorporating the prior into the variational lower bound. In this work, we take a different route -- we explicitly enhance input-output connections by maximizing their mutual information. To this end, we introduce a Conditional Discrete Contrastive Diffusion (CDCD) loss and design two contrastive diffusion mechanisms to effectively incorporate it into the denoising process, combining the diffusion training and contrastive learning for the first time by connecting it with the conventional variational objectives. We demonstrate the efficacy of our approach in evaluations with diverse multimodal conditional synthesis tasks: dance-to-music generation, text-to-image synthesis, as well as class-conditioned image synthesis. On each, we enhance the input-output correspondence and achieve higher or competitive general synthesis quality. Furthermore, the proposed approach improves the convergence of diffusion models, reducing the number of required diffusion steps by more than 35% on two benchmarks, significantly increasing the inference speed.
翻译:进化概率模型(DPMs)已成为一种流行的有条件生成方法,因为其有希望的结果和支持跨模式合成。在有条件合成中,关键的偏差在于实现调节投入和生成产出之间的高度对应性。大多数现有方法通过将前一种结合纳入变式下限而隐含地学习这种关系。在这项工作中,我们采取了不同的路线 -- -- 我们通过最大限度地增加相互信息,明确加强输入-产出连接。为此,我们引入了一种有条件的分解相矛盾融合(CDCDCD)损失,并设计了两种对比性传播机制,以便有效地将它纳入分解过程,将传播培训和对比学习与传统的变异目标联系起来,首次结合到传播培训和对比学习。我们展示了我们的方法在评价中的有效性,涉及不同的多式联运有条件合成任务:舞蹈到音乐的生成、文本到图像合成,以及有等级限制的图像合成。关于每种目的,我们都会加强输入-数字对应,并实现更高或竞争性的总体合成质量。此外,拟议的方法大大改进了扩散模式的趋同性,通过两个步骤减少所需的扩散速度,从而降低35的进度。