Masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training approach in the vision domain. However, the interpretability of the mechanism and properties of the learned representations by such a scheme are so far not well-explored. In this work, through comprehensive experiments and empirical studies on Masked Autoencoders (MAE), we address two critical questions to explore the behaviors of the learned representations: (i) Are the latent representations in Masked Autoencoders linearly separable if the input is a mixture of two images instead of one? This can be concrete evidence used to explain why MAE-learned representations have superior performance on downstream tasks, as proven by many literature impressively. (ii) What is the degree of semantics encoded in the latent feature space by Masked Autoencoders? To explore these two problems, we propose a simple yet effective Interpretable MAE (i-MAE) framework with a two-way image reconstruction and a latent feature reconstruction with distillation loss to help us understand the behaviors inside MAE's structure. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K datasets to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two novel metrics. The surprising and consistent results across the qualitative and quantitative experiments demonstrate that i-MAE is a superior framework design for interpretability research of MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.
翻译:在这项工作中,我们通过对蒙面自动编码器(MAE)的全面实验和经验研究,解决了两个关键问题,以探讨所学显示的行为:(一) 如果输入是两种图像而不是一种图像的混合体,在蒙面自动编码器(MIM)中的潜在形象是可线性分离的吗?这可以作为具体证据,用来解释为什么MAE所学的模型在下游任务中具有较高的性能,许多文献都证明了这一点。在这项工作中,通过对蒙面自动编码器(MAE)进行综合实验和实验性研究,我们提出一个简单而有效的解释性MAE(iMAE)框架,同时进行双向图像重建,并进行潜在特征重建,以便帮助我们理解MAE所学结构中的行为,对下游任务进行更高级的实验。在IMA-Net结构中,通过对IMA-MAR(IMA)的高级数据结构进行深度分析,对IMA(IMA)进行广泛的实验,对IMAR-S(IMA)数据结构进行更精确的升级。