MST: 用于视觉演示的蒙面自操作变换器 (MST: Masked Self-Supervised Transformer for Visual Representation)

Transformer has been widely used for self-supervised pre-training in Natural Language Processing (NLP) and achieved great success. However, it has not been fully explored in visual self-supervised learning. Meanwhile, previous methods only consider the high-level feature and learning representation from a global perspective, which may fail to transfer to the downstream dense prediction tasks focusing on local features. In this paper, we present a novel Masked Self-supervised Transformer approach named MST, which can explicitly capture the local context of an image while preserving the global semantic information. Specifically, inspired by the Masked Language Modeling (MLM) in NLP, we propose a masked token strategy based on the multi-head self-attention map, which dynamically masks some tokens of local patches without damaging the crucial structure for self-supervised learning. More importantly, the masked tokens together with the remaining tokens are further recovered by a global image decoder, which preserves the spatial information of the image and is more friendly to the downstream dense prediction tasks. The experiments on multiple datasets demonstrate the effectiveness and generality of the proposed method. For instance, MST achieves Top-1 accuracy of 76.9% with DeiT-S only using 300-epoch pre-training by linear evaluation, which outperforms supervised methods with the same epoch by 0.4% and its comparable variant DINO by 1.0\%. For dense prediction tasks, MST also achieves 42.7% mAP on MS COCO object detection and 74.04% mIoU on Cityscapes segmentation only with 100-epoch pre-training.

翻译：在自然语言处理(NLP)中,自我监督的变压器被广泛用于自然语言处理(NLP)的自我监督前培训,并取得了巨大成功。然而,在视觉自我监督学习中,并没有对它进行充分探讨。与此同时,以往的方法只是从全球角度考虑高层次特征和学习代表性,这可能无法转移到下游密集的预测任务,侧重于本地特征。在本文中,我们展示了名为MST的新颖的蒙面自我监督变压器方法,它可以明确捕捉图像的当地背景,同时保存全球语义信息。具体地,在NLP的蒙面语言建模(MLMM)的启发下,我们提出了一个基于多头自我关注地图的蒙面象征性战略。多头自我关注地图以动态方式遮掩一些本地补符号,而不会损害以本地特征为主的学习关键结构。更重要的是,全球图像解压器进一步恢复了图像的空间信息,并且更有利于下游密集的预测任务。在多层语言模型上,我们提出了一种代号-ST-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-