There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders.
翻译:传统的观点认为,生成模型可以帮助真正理解视觉数据。在最近denoising diffusion models的研究中,我们重新考虑了预训练视觉表示的生成式方法。虽然直接使用diffusion models进行预训练并不能产生强大的表示,但我们在输入上使用了遮罩,将diffusion models构建为遮罩自编码器(DiffMAE)。我们的方法能够(i)为下游识别任务提供强大的初始化,(ii)进行高质量的图像修复,并且(iii)很容易扩展到视频,其中它实现了最先进的分类准确性。我们进一步进行了全面的设计选择研究,并建立了diffusion models和遮罩自编码器之间的联系。