We present multimodal conditioning modules (MCM) for enabling conditional image synthesis using pretrained diffusion models. Previous multimodal synthesis works rely on training networks from scratch or fine-tuning pretrained networks, both of which are computationally expensive for large, state-of-the-art diffusion models. Our method uses pretrained networks but does not require any updates to the diffusion network's parameters. MCM is a small module trained to modulate the diffusion network's predictions during sampling using 2D modalities (e.g., semantic segmentation maps, sketches) that were unseen during the original training of the diffusion model. We show that MCM enables user control over the spatial layout of the image and leads to increased control over the image generation process. Training MCM is cheap as it does not require gradients from the original diffusion net, consists of only $\sim$1$\%$ of the number of parameters of the base diffusion model, and is trained using only a limited number of training examples. We evaluate our method on unconditional and text-conditional models to demonstrate the improved control over the generated images and their alignment with respect to the conditioning inputs.
翻译:以往的多式联运合成工作依靠从零开始的培训网络或从微调预先培训的网络来进行培训网络,这两个网络对于大型的、最先进的传播模型来说计算成本很高。我们的方法是使用预先培训的网络,但并不要求更新传播网络的参数。MCM是一个小型模块,在使用2D模式(例如语义分割图、草图)进行取样时,受过调整传播网络预测的培训,这些模式在最初的传播模型培训期间是看不见的。我们表明,MCM使用户能够控制图像的空间布局,并导致加强对图像生成过程的控制。培训MCM费用低,因为它不需要原始传播网络的梯度,仅包括基础传播模型参数数的1美元,仅用数量有限的培训实例进行培训。我们评估了我们关于无条件和文本附加模型的方法,以展示对生成图像的更好控制及其与调节投入的一致性。</s>