The excellent generative capabilities of text-to-image diffusion models suggest they learn informative representations of image-text data. However, what knowledge their representations capture is not fully understood, and they have not been thoroughly explored on downstream tasks. We investigate diffusion models by proposing a method for evaluating them as zero-shot classifiers. The key idea is using a diffusion model's ability to denoise a noised image given a text description of a label as a proxy for that label's likelihood. We apply our method to Imagen, using it to probe fine-grained aspects of Imagen's knowledge and comparing it with CLIP's zero-shot abilities. Imagen performs competitively with CLIP on a wide range of zero-shot image classification datasets. Additionally, it achieves state-of-the-art results on shape/texture bias tests and can successfully perform attribute binding while CLIP cannot. Although generative pre-training is prevalent in NLP, visual foundation models often use other methods such as contrastive learning. Based on our findings, we argue that generative pre-training should be explored as a compelling alternative for vision and vision-language problems.
翻译:文字到图像扩散模型具有出色的生成能力,表明它们学习了有关图像-文本数据的信息表示。然而,它们所捕获的信息并不完全被理解,并且它们尚未在下游任务中得到充分探索。我们通过提出一种方法将扩散模型评估为零样本分类器来研究扩散模型。关键思想是使用扩散模型的去噪能力,在给定一个标签的文本描述的情况下去噪一个带有噪声的图像,作为该标签的概率的代理。我们将这种方法应用于Imagen,使用它来探测Imagen知识的精细方面,并将其与CLIP的零样本能力进行比较。在广泛的零样本图像分类数据集上,Imagen的表现与CLIP相当。此外,它在形状/纹理偏差测试中实现了最先进的结果,并且可以成功执行属性绑定,而CLIP则无法执行。尽管生成式预训练在NLP中很普遍,但是视觉基础模型通常使用其他方法,例如对比学习。根据我们的发现,我们认为应该探索生成式预训练作为视觉和视觉语言问题的有力替代方法。