Inspired by masked language modeling (MLM) in natural language processing, masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training method in computer vision. However, its high random mask ratio would result in two serious problems: 1) the data are not efficiently exploited, which brings inefficient pre-training (\eg, 1600 epochs for MAE $vs.$ 300 epochs for the supervised), and 2) the high uncertainty and inconsistency of the pre-trained model, \ie, the prediction of the same patch may be inconsistent under different mask rounds. To tackle these problems, we propose efficient masked autoencoders with self-consistency (EMAE), to improve the pre-training efficiency and increase the consistency of MIM. In particular, we progressively divide the image into K non-overlapping parts, each of which is generated by a random mask and has the same mask ratio. Then the MIM task is conducted parallelly on all parts in an iteration and generates predictions. Besides, we design a self-consistency module to further maintain the consistency of predictions of overlapping masked patches among parts. Overall, the proposed method is able to exploit the data more efficiently and obtains reliable representations. Experiments on ImageNet show that EMAE achieves even higher results with only 300 pre-training epochs under ViT-Base than MAE (1600 epochs). EMAE also consistently obtains state-of-the-art transfer performance on various downstream tasks, like object detection, and semantic segmentation.
翻译:受自然语言处理中隐蔽语言模型(MLM)的启发,蒙面图像模型(MIM)被确认为计算机视觉中一种强大和受欢迎的自我监督前训练方法。然而,高随机掩码比率会导致两个严重问题:1)数据没有得到有效利用,从而导致低效的训练前(MAE $300美元在自然语言处理中为1600个时代),2)预先培训模式(MIM)的高度不确定性和不一致性,在不同的掩码回合中,对同一补丁的预测可能不一致。为了解决这些问题,我们提议了高效的自我监督前训练方法(EMAE),提高培训前的效率和提高MIM的一致性。特别是,我们逐渐将图像分为K非重叠部分,每个部分都是由随机遮罩产生的,并具有相同的遮罩率。然后,MIM任务只平行地执行所有部分,比如目标。此外,我们设计了一个具有自我一致性且具有自我一致性的自动检测功能的自动解码自动解码组件(EME),以更稳定的方式对EIMA系统进行精确的模拟演算。</s>