Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations at its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for self-supervised generative and discriminative learning. To verify this, we perform linear probe and fine-tuning evaluations on multi-class datasets. Our diffusion-based approach achieves 95.9% and 50.0% linear probe accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to masked autoencoders and contrastive learning for the first time. Additionally, transfer learning from ImageNet confirms DDAE's suitability for latent-space Vision Transformers, suggesting the potential for scaling DDAEs as unified foundation models.
翻译:受扩散模型的最新进展启发,本文研究它们是否可以通过生成式预训练获取判别性表示以进行分类。本文表明,扩散模型中的网络,即去噪扩散自编码器(DDAE),是统一的自监督学习器:通过在无条件图像生成上进行预训练,DDAE已经学会了中间层的极强的线性可分表示,而无需辅助编码器,从而使扩散预训练成为自监督生成式和判别式学习的通用方法。为验证这一点,我们在多类数据集上进行线性探针和微调评估。我们基于扩散的方法在CIFAR-10和Tiny-ImageNet上分别实现了95.9%和50.0%的线性探针准确率,与遮蔽自编码器和对比学习首次具有可比性。此外,来自ImageNet的转移学习证实了DDAE在潜在视觉变换器中的适用性,表明可以将DDAE扩展为统一的基础模型。