With the development of generative-based self-supervised learning (SSL) approaches like BeiT and MAE, how to learn good representations by masking random patches of the input image and reconstructing the missing information has grown in concern. However, BeiT and PeCo need a "pre-pretraining" stage to produce discrete codebooks for masked patches representing. MAE does not require a pre-training codebook process, but setting pixels as reconstruction targets may introduce an optimization gap between pre-training and downstream tasks that good reconstruction quality may not always lead to the high descriptive capability for the model. Considering the above issues, in this paper, we propose a simple Self-distillated masked AutoEncoder network, namely SdAE. SdAE consists of a student branch using an encoder-decoder structure to reconstruct the missing information, and a teacher branch producing latent representation of masked tokens. We also analyze how to build good views for the teacher branch to produce latent representation from the perspective of information bottleneck. After that, we propose a multi-fold masking strategy to provide multiple masked views with balanced information for boosting the performance, which can also reduce the computational complexity. Our approach generalizes well: with only 300 epochs pre-training, a vanilla ViT-Base model achieves an 84.1% fine-tuning accuracy on ImageNet-1k classification, 48.6 mIOU on ADE20K segmentation, and 48.9 mAP on COCO detection, which surpasses other methods by a considerable margin. Code is available at https://github.com/AbrahamYabo/SdAE.
翻译:随着基于基因的自我监督学习(SSL)方法的发展,如BeiT和MAE,如何通过掩盖输入图像的随机补丁以及重建缺失的信息来学习良好的表现,这引起了人们的关注。然而,BeiT和Peco需要一个“预先培训”阶段,以便为隐蔽的补丁制作独立的代码手册。MAE不需要一个培训前的代码手册程序,但设置像素,因为重建目标可能会在Vi20前和下游任务之间造成最优化的差距,而良好的重建质量并不总是导致模型的高描述能力。考虑到上述问题,我们在本文中建议建立一个简单的自我提炼的隐蔽自动编码网络网络,即SdAE。SdAE包括一个使用编码解码结构来重建隐蔽信息的学生分支,而教师分支则产生隐蔽的代码。我们还分析了如何为教师分支建立良好的观点,以便从信息瓶颈的角度产生可实现的潜值代表。随后,我们提议了一个多倍的缩略图战略,在300A-RODA上提供多重的缩略图。