We introduce Palette, a simple and general framework for image-to-image translation using conditional diffusion models. On four challenging image-to-image translation tasks (colorization, inpainting, uncropping, and JPEG decompression), Palette outperforms strong GAN and regression baselines, and establishes a new state of the art. This is accomplished without task-specific hyper-parameter tuning, architecture customization, or any auxiliary loss, demonstrating a desirable degree of generality and flexibility. We uncover the impact of using $L_2$ vs. $L_1$ loss in the denoising diffusion objective on sample diversity, and demonstrate the importance of self-attention through empirical architecture studies. Importantly, we advocate a unified evaluation protocol based on ImageNet, and report several sample quality scores including FID, Inception Score, Classification Accuracy of a pre-trained ResNet-50, and Perceptual Distance against reference images for various baselines. We expect this standardized evaluation protocol to play a critical role in advancing image-to-image translation research. Finally, we show that a single generalist Palette model trained on 3 tasks (colorization, inpainting, JPEG decompression) performs as well or better than task-specific specialist counterparts.
翻译:我们引入了Palette, 这是一个使用有条件的传播模型进行图像到图像翻译的简单和一般框架。 在四项具有挑战性的图像到图像翻译任务(彩色化、油漆、不编织和JPEG decompression)中,Palette优于强大的GAN和回归基线,并建立了新的艺术状态。这是在没有特定任务超参数调、结构定制或任何辅助损失的情况下实现的,显示了一个理想的普遍性和灵活性。我们发现在排除图像到图像多样性的传播目标中使用$_2美元相对于$1美元损失的影响,并通过经验性结构研究表明自我关注的重要性。重要的是,我们倡导基于图像网络的统一评价协议,并报告若干样本质量评分,包括FID、概念评分、预先培训的ResNet-50的分类准确性,以及相对于各种基线参考图像的感官距离。我们期望这一标准化评估协议在推进图像到图像化翻译研究方面发挥关键作用。最后,我们展示了在3项任务上经过更好培训的单一一般的图像到图像模型模型,或者对等任务进行更好的分析。