Masked image modeling (MIM) learns visual representation by masking and reconstructing image patches. Applying the reconstruction supervision on the CLIP representation has been proven effective for MIM. However, it is still under-explored how CLIP supervision in MIM influences performance. To investigate strategies for refining the CLIP-targeted MIM, we study two critical elements in MIM, i.e., the supervision position and the mask ratio, and reveal two interesting perspectives, relying on our developed simple pipeline, context autodecoder with CLIP target (CAE v2). Firstly, we observe that the supervision on visible patches achieves remarkable performance, even better than that on masked patches, where the latter is the standard format in the existing MIM methods. Secondly, the optimal mask ratio positively correlates to the model size. That is to say, the smaller the model, the lower the mask ratio needs to be. Driven by these two discoveries, our simple and concise approach CAE v2 achieves superior performance on a series of downstream tasks. For example, a vanilla ViT-Large model achieves 81.7% and 86.7% top-1 accuracy on linear probing and fine-tuning on ImageNet-1K, and 55.9% mIoU on semantic segmentation on ADE20K with the pre-training for 300 epochs. We hope our findings can be helpful guidelines for the pre-training in the MIM area, especially for the small-scale models.
翻译:蒙面图像模型(MIM) 通过遮蔽和重建图像补丁来学习视觉表现。 应用 CLIP 代表处的重建监督已证明对MIM 有效。 但是, 它仍然在探索MIM 中CLIP 监督如何影响业绩。 为了调查改进 CLIP 目标的MIM 战略, 我们研究了MIM 的两个关键要素, 即监督位置和遮罩比, 并揭示了两个有趣的视角, 依靠我们开发的简单管道、 环境自动显示器与 CLIP 目标( CAE v2 ) 进行。 首先, 我们观察到, 对可见部分的 CLIP 代表处的重建监督取得了显著的性能, 甚至比在遮面补处的监督效果更好, 后者是现有MIM 方法中的标准格式。 第二, 最佳遮面比例与模型大小有正相关。 也就是说, 模型越小, 遮面比例越低。 受这两个发现驱动, 我们简单和简洁的CAEV20 方法可以在一系列下游任务中取得优性业绩。 例如, VILA- Large 模型在 IM- main- imal- imalimal- imalimalimalimal imalimal 1%- 和86- 和86- bal- breal- brealisal- brealimal- brealisal- misal- misal- misal- misal- misal- bal- bal- bal- balisal- bal- baldaldal- bal- m- baldal- m- m- 和86- misal- bal- baldal- bal- m- m- baldal- m- bal- m- m- ladal- m- m- m- m- ladalizalizalizaldaldaldaldaldaldaldaldal- m- m- ladal- m- m- m- m- m- m- m- m- m- m- m- m- m- lial- m- m- m- m- m- m- misal- misal- m-