Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct.
翻译:掩模图像模型 (MIM) 由于其构建可扩展视觉表示的潜在优势已经引起了广泛的关注。在传统方法中,模型通常专注于预测掩码补丁的特定内容,其性能与预定义的掩码策略高度相关。直觉上,这个过程可以被视为在训练一个学生(模型)通过解决给出的问题(预测掩码补丁)来进行训练。然而,我们认为模型不仅应该专注于解决给出的问题,还应该站在教师的角度自行生成更具挑战性的问题。为此,我们提出了一个全新的 MIM 预训练框架:难样本挖掘 (HPM)。我们观察到,在这种情况下,重建损失可以自然地成为预训练任务难度的度量。因此,我们引入了一个辅助 loss 预测器,首先预测补丁级别的损失,然后决定下一个应该掩蔽哪里。它采用了一种相对关系学习策略,以防止过度拟合到精确的重建损失值。在各种设置下的实验证明了 HPM 在构建掩模图像方面的有效性。此外,我们经验性地发现,仅引入 loss 预测目标就可以产生有效的表示,从而验证了意识到哪些地方难以重建的能力的有效性。