This paper presents SimMIM, a simple framework for masked image modeling. We simplify recently proposed related approaches without special designs such as block-wise masking and tokenization via discrete VAE or clustering. To study what let the masked image modeling task learn good representations, we systematically study the major components in our framework, and find that simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a strong pre-text task; 2) predicting raw pixels of RGB values by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied on a larger model of about 650 million parameters, SwinV2-H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to facilitate the training of a 3B model (SwinV2-G), that by $40\times$ less data than that in previous practice, we achieve the state-of-the-art on four representative vision benchmarks. The code and models will be publicly available at https://github.com/microsoft/SimMIM.
翻译:本文展示了隐藏图像建模的简单框架SimMIM。 我们简化了最近提出的相关方法,而没有特殊设计, 例如通过离散 VAE 或群集来进行块化掩码和象征化。 为了研究让遮蔽图像建模工作学会好表现, 我们系统地研究我们框架中的主要组成部分, 发现每个组成部分的简单设计都显示了非常强的代表性学习性能:1) 随机遮盖输入图像, 其掩码大小略大( 例如, 32) 具有很强的预文本任务; 2) 通过直接回归来预测 RGB 值的原始像素, 并不比复杂设计的补丁分类方法差; 3 预测头可以像线性层一样轻, 其性能不比重。 使用 VT- B, 我们的方法在图像- 1 B 上达到83. 上1 微调精度的精确度, 也通过对这个数据集进行预先培训, +. 6% 。 当应用大约6.50万个参数的更大模型时, SwinV2-H, 它在图像- 1 的精确度上达到图像- 1 的精确度, 在图像- S 的模型中, 数据- k- 的模型中, 仅由图像- g- b 的模型 的模型中, 我们的模型在使用该数据- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d- d-d- d- d-d- d- d-d- d- d-d-d-d-d-d-d-d-d-d- d-d-d- d- d-d-d-d-d-d-d-