We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial mask tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation. For example, 300-epoch CIM pre-trained vanilla ViT-Base/16 and ResNet-50 obtain 83.3 and 80.6 Top-1 fine-tuning accuracy on ImageNet-1K image classification respectively.
翻译:我们引入了用于自我监督的视觉预训练的腐败图像建模(CIM) 。 CIM 使用一个带有小型培训BeiT的辅助发电机来腐蚀输入图像,而不是使用人工遮罩符号,因为一些补丁是随机挑选的,取而代之的是BeiT 输出分布样本的貌似替代物。 鉴于这一被腐蚀的图像,一个强化网络学会了要么回收所有原始图像像素,要么预测每个视觉符号是否被一个发电机样本所取代。 发电机和增强器同时经过培训和协同更新。 在培训前,该增强器可以用作下游任务的高容量视觉解码器。 CIM 是适合各种网络结构的一般和灵活的视觉预设培训框架。 第一次, CIM 和CNNC 都通过一个统一的、非图像像素框架学习丰富的视觉演示。 实验结果表明,我们的方法在视觉基准方面取得了令人信服的结果, 如图像网分类和ADE20K 语系断段。 例如, 300-epoch CIM 之前的CIM VIT16 和S- 1 Stregrialalalim-VAVIT16 和S- 80- SI- SI-ILAVAT16 和S-VIS- VI- III-VI-ILAST-VI-VI-IAL-VI-VI-616 和80-VI-VI-VI-I- 和图像-VI-616) 和S-IAP-IAP-IAR-I-IAR 和图像分类。