Recently, self-supervised vision transformers have attracted unprecedented attention for their impressive representation learning ability. However, the dominant method, contrastive learning, mainly relies on an instance discrimination pretext task, which learns a global understanding of the image. This paper incorporates local feature learning into self-supervised vision transformers via Reconstructive Pre-training (RePre). Our RePre extends contrastive frameworks by adding a branch for reconstructing raw image pixels in parallel with the existing contrastive objective. RePre is equipped with a lightweight convolution-based decoder that fuses the multi-hierarchy features from the transformer encoder. The multi-hierarchy features provide rich supervisions from low to high semantic information, which are crucial for our RePre. Our RePre brings decent improvements on various contrastive frameworks with different vision transformer architectures. Transfer performance in downstream tasks outperforms supervised pre-training and state-of-the-art (SOTA) self-supervised counterparts.
翻译:最近,自我监督的视觉变异器因其令人印象深刻的学习能力而吸引了前所未有的关注。然而,主导性方法,即对比性学习,主要依赖于一种实例歧视的借口任务,即学习全球对图像的理解。本文通过重新构思前训练(Repre)将本地特色学习纳入自我监督的视觉变异器。我们的“重新开发”扩展了对比性框架,在现有的对比性目标的同时,增加了一个重建原始图像像素的分支。Repre配备了一个轻量级组合式解密器,将变异器的多层次特征结合在一起。多层次特征提供了从低到高层次的语义学信息的丰富监督,这对我们的“再开发”至关重要。我们的“再开发”为不同视觉变异结构的各种对比性框架带来了体面的改进。 下游任务比受监管的训练前和艺术(SOTA)自我监督的对口单位的绩效转移。