In this study, we propose Mixed and Masked Image Modeling (MixMIM), a simple but efficient MIM method that is applicable to various hierarchical Vision Transformers. Existing MIM methods replace a random subset of input tokens with a special MASK symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the MASK symbol greatly slows down the training and causes training-finetuning inconsistency, due to the large masking ratio (e.g., 40% in BEiT). In contrast, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the original two images from the mixed input, which significantly improves efficiency. While MixMIM can be applied to various architectures, this paper explores a simpler but stronger hierarchical Transformer, and scales with MixMIM-B, -L, and -H. Empirical results demonstrate that MixMIM can learn high-quality visual representations efficiently. Notably, MixMIM-B with 88M parameters achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs, setting a new record for neural networks with comparable model sizes (e.g., ViT-B) among MIM methods. Besides, its transferring performances on the other 6 datasets show MixMIM has better FLOPs / performance tradeoff than previous MIM methods. Code is available at https://github.com/Sense-X/MixMIM.
翻译:在此研究中,我们提出混合和遮罩图像建模(MixMIM),这是一种简单但有效的MIM方法,适用于各种等级的视觉变异器。现有的MIM方法用一个特殊的MASK符号取代一个随机的输入符号子集,目的是从腐败图像中重建原始图像符号。然而,我们发现,使用MASK符号会大大减缓培训,并导致培训-调整不一致,因为遮盖率高(例如BeiLT中的40%)。相比之下,我们用另一个图像的可见标记取代一个图像的遮罩符号,即创建一个混合图像。我们随后进行双重重建,从混合输入中重建原始的两种图像,大大提高了效率。虽然MixMIMIMMIM可以应用到不同的结构中,但本文探索了一个更简单但更强的等级变异器,以及MixMIM-B、-L和-H. EmpricalMIMIMMIM可以有效地学习高品质的图像显示。在MixMIM-B中,Mix-MIM-B有比88MM-MI-ML的参数可以显示85.MIMS/SAs 上的最新数据记录。