Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without fine-tuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner.
翻译:在图像生成方面,基于扩散的基因模型取得了显著的成功。它们的指导性设计使外部模型能够在不微调扩散模型的情况下控制各种任务的生成过程。然而,直接使用公开可得的现成指导模型,因其在噪音输入方面的表现不佳而未能成功。为此,现行做法是用标签数据对指导模型进行微调,而标签数据因噪音而腐蚀。在本文件中,我们争辩说,这种做法在两个方面有局限性:(1) 以极其多种噪音进行投入的操作对于单一模型来说过于困难;(2) 收集标签数据集阻碍扩大各种任务的规模。为克服这些限制,我们提议采用新的战略,利用多位专家,即每位专家在特定噪音范围内专门使用现成的现成指导模式指导反向进程。然而,由于管理多个网络和使用标签数据不可行,我们提出了一个实用的指导框架,即实用的普卢格和普莱(PPAP),它利用参数高效的微调和无数据知识转移。我们详尽的图像网络分类化工具,我们用不易懂的升级的模型指导,我们用不易在公开的图像分类中进行分类和升级的实验室实验,我们用这种方法展示。