The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation. In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. We also find that our diffusion-based approach has stronger multimodal relational reasoning abilities than competing contrastive approaches. Finally, we evaluate diffusion models trained on ImageNet and find that they approach the performance of SOTA discriminative classifiers trained on the same dataset, even with weak augmentations and no regularization. Results and visualizations at https://diffusion-classifier.github.io/
翻译:近期的大规模文本到图像扩散模型大大增强了我们的基于文本的图像生成能力。这些模型可以为各种各样的提示生成逼真的图像,并展现了令人惊叹的组合泛化能力。迄今为止, 几乎所有的用例都只关注了采样。然而,扩散模型还可以提供条件密度估计,这对于超越图像生成的任务非常有用。在本文中,我们展示了像 Stable Diffusion 这样的大规模文本到图像扩散模型的密度估计可以被用于进行零样本分类,而无需任何其他的训练。我们的生成式分类方法在各种基准测试中都表现出强大的结果,并优于从扩散模型中提取知识的其他方法。我们还发现,我们基于扩散的方法在比赛中的多模式关系推理能力比其他对比方法更强。最后,我们评估了在 ImageNet 上训练的扩散模型,并发现它们即使在弱数据增广和没有正则化的情况下也接近于使用相同数据集训练的状态的判别分类器的性能。结果和可视化见 https://diffusion-classifier.github.io/