We introduce Corrupted Image Modeling (CIM) for self-supervised visual pre-training. CIM uses an auxiliary generator with a small trainable BEiT to corrupt the input image instead of using artificial [MASK] tokens, where some patches are randomly selected and replaced with plausible alternatives sampled from the BEiT output distribution. Given this corrupted image, an enhancer network learns to either recover all the original image pixels, or predict whether each visual token is replaced by a generator sample or not. The generator and the enhancer are simultaneously trained and synergistically updated. After pre-training, the enhancer can be used as a high-capacity visual encoder for downstream tasks. CIM is a general and flexible visual pre-training framework that is suitable for various network architectures. For the first time, CIM demonstrates that both ViT and CNN can learn rich visual representations using a unified, non-Siamese framework. Experimental results show that our approach achieves compelling results in vision benchmarks, such as ImageNet classification and ADE20K semantic segmentation.
翻译:我们引入了用于自我监督的视觉预训练的腐败图像建模(CIM) 。 CIM 使用一个带有小型可训练BeiT的辅助发电机来腐蚀输入图像,而不是使用人工[MASK]符号,在这些符号中,一些补丁是随机选择的,取而代之的是从BeiT输出分布中抽取的可信的替代物。鉴于这种损坏的图像,一个强化网络学会了要么回收所有原始图像像素,或者预测每个视觉符号是否由发电机样本取代。 发电机和增强器同时经过培训和协同更新。 在培训前, 增强器可以用作下游任务的高容量视觉编码器。 CIM 是适合各种网络结构的通用和灵活的视觉前训练框架。 第一次, CIM 表明, VIT 和CNN 都能通过一个统一的非图像框架学习丰富的视觉表现。 实验结果显示, 我们的方法在视觉基准中取得了令人信服的结果, 如图像网络分类和 ADE20K 语义分割。</s>