转向实用化的即插即用扩散模型 (Towards Practical Plug-and-Play Diffusion Models)

Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without finetuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP.

翻译：即插即用的基于扩散（Diffusion）的生成模型在图像生成方面取得了显著的成功。通过它们的引导公式，外部模型可以无需对扩散模型进行微调就能控制各种任务的生成过程。然而，由于公开的即插即用模型在噪声图像上的性能较差，使用这些模型进行引导的直接应用会失败。因此，现有的做法是使用带有噪声的标注数据对引导模型进行微调。本文认为此方法存在两个局限性：（1）对极为不同噪声的输入进行处理对单个引导模型来说太难了；（2）收集标注数据会阻碍各种任务的扩展。为了解决这些限制，我们提出了一种新的策略，利用多个“专家模型”，其中每个模型专门负责处理一个噪声范围并引导对应的步骤下的扩散反向过程。然而，由于管理多个模型并利用标注数据是不可行的，因此我们提出了一种实用方案，称为参数高效的微调和无数据知识转移的“即插即用”（Practical Plug-And-Play，PPAP）引导框架。我们进行了详尽的ImageNet分类条件生成实验，结果表明我们的方法可以在没有标注数据的情况下，用小的可训练参数成功引导扩散。最后，我们展示了图像分类器、深度估计器和语义分割模型如何通过我们的框架以即插即用的方式引导公开的GLIDE模型。我们的代码可在https://github.com/riiid/PPAP找到。