We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data.
翻译:我们提出对蒙面自动编码器(MAE)的扩展,通过明确鼓励学习较高场景级特征,改进模型所学的表述方式。我们这样做的方法是:(一) 在生成图像和实际图像之间引入一种概念相似的术语;(二) 纳入来自对抗性培训文献的若干技术,包括多级培训和适应性歧视增强。这些组合的结果不仅改善了像素重建,而且似乎在图像中捕捉了更好的更高层细节。更为重要的是,我们展示了我们的“概念式MAE”方法如何在下游任务中使用比以往方法效果更好的表现。我们在图像Net-1K上实现了78.1%的上一级精度线性实验,在微调时达到88.1%,而其他下游任务则没有使用其他经过培训的模型或数据,结果相似。