The success of language Transformers is primarily attributed to the pretext task of masked language modeling (MLM), where texts are first tokenized into semantically meaningful pieces. In this work, we study masked image modeling (MIM) and indicate the advantages and challenges of using a semantically meaningful visual tokenizer. We present a self-supervised framework iBOT that can perform masked prediction with an online tokenizer. Specifically, we perform self-distillation on masked patch tokens and take the teacher network as the online tokenizer, along with self-distillation on the class token to acquire visual semantics. The online tokenizer is jointly learnable with the MIM objective and dispenses with a multi-stage training pipeline where the tokenizer needs to be pre-trained beforehand. We show the prominence of iBOT by achieving an 82.3% linear probing accuracy and an 87.8% fine-tuning accuracy evaluated on ImageNet-1K. Beyond the state-of-the-art image classification results, we underline emerging local semantic patterns, which helps the models to obtain strong robustness against common corruptions and achieve leading results on dense downstream tasks, eg., object detection, instance segmentation, and semantic segmentation.
翻译:语言变换器的成功主要归因于隐形语言模型(MLMM)的托辞任务, 文本首先被象征性地化为具有语义意义的部分。 在这项工作中, 我们研究隐形图像模型( MIM), 并指出使用隐形有意义的视觉象征器的优点和挑战。 我们展示了一个自我监督的框架 iBOT, 它可以使用在线象征器进行隐形预测。 具体地说, 我们用隐形补丁符号进行自我蒸馏, 将教师网络作为在线象征, 同时在课堂标牌上进行自我蒸馏, 以获得视觉语义。 在线代号与MIM目标共同学习, 并配有多阶段培训管道, 标识器需要事先经过培训。 我们通过在图像网络-1K上实现82.3%的线性精确度和87.8%的微调精度评估来显示iBOT的亮度。 除了最先进的图像分类结果外, 我们强调正在出现的本地语义模式, 帮助目标获得强大的坚固性, 防止普通腐败, 检测和下游任务达到高端部分。