We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.
翻译:----
我们提出了一种遮蔽自编码器 (MAE) 的扩展,通过明确鼓励模型学习更高的场景级特征来改进模型学习到的表示。我们通过以下方法实现:(i) 引入生成图像和真实图像之间的感知相似性项; (ii) 结合对抗训练文献中的几种技术,包括多尺度训练和自适应鉴别器增强。这些的组合结果不仅可以更好地重建像素,还可以表示更好的图像中的更高层次细节。更重要的是,我们展示了我们的方法——感知 MAE,在使用于下游任务时的表现更佳,优于之前的方法。我们在 ImageNet-1K 上达到了 78.1% 的顶部 1 准确率线性探针测试,并在微调时获得了高达 88.1% 的准确率,对于其他下游任务也有类似的结果,所有这些没有使用附加的预训练模型或数据。