Masked Autoencoders (MAE) have been popular paradigms for large-scale vision representation pre-training. However, MAE solely reconstructs the low-level RGB signals after the decoder and lacks supervision upon high-level semantics for the encoder, thus suffering from sub-optimal learned representations and long pre-training epochs. To alleviate this, previous methods simply replace the pixel reconstruction targets of 75% masked tokens by encoded features from pre-trained image-image (DINO) or image-language (CLIP) contrastive learning. Different from those efforts, we propose to Mimic before Reconstruct for Masked Autoencoders, named as MR-MAE, which jointly learns high-level and low-level representations without interference during pre-training. For high-level semantics, MR-MAE employs a mimic loss over 25% visible tokens from the encoder to capture the pre-trained patterns encoded in CLIP and DINO. For low-level structures, we inherit the reconstruction loss in MAE to predict RGB pixel values for 75% masked tokens after the decoder. As MR-MAE applies high-level and low-level targets respectively at different partitions, the learning conflicts between them can be naturally overcome and contribute to superior visual representations for various downstream tasks. On ImageNet-1K, the MR-MAE base pre-trained for only 400 epochs achieves 85.8% top-1 accuracy after fine-tuning, surpassing the 1600-epoch MAE base by +2.2% and the previous state-of-the-art BEiT V2 base by +0.3%. Code and pre-trained models will be released at https://github.com/Alpha-VL/ConvMAE.
翻译:VADE 是大型图像模拟(DINO) 或图像语言(CLIP) 的对比性学习的流行范例。 然而, MAE 仅仅在解码器之后重建了低水平 RGB 信号, 缺乏对编码器高级语义的监督, 从而在培训前受到低于最佳的学习表现和长时间的培训前时代的干扰。 为了缓解这一点, 先前的方法只是用预培训图像模拟(DINO) 或下游图像语言( CLIP) 的比对性学习来取代75%的像素掩码符号重建目标。 与这些努力不同的是, 我们提议在重新构建MODRMAE 之前进行MUMGB信号模拟, 联合学习高层次和低层次的语义表达。 对于高级语系而言, MIDO 之前的数学模型(IMA) 和低层次的图像演示团(IMO), 在IMA 之后, 将使用IMA IMA 和 IMO (IMO) 不同层次的 main IM IM (IM (IM) (IM) IM Beal Beal BE) 等 高层次的数值, 高级语言(IM) 高级语言, 等前解算值(O), 等前的数值, 高级的值将用前解算算算法, 等前的解算法, 等前的值为前的解到前25% BIM- 。</s>