We are witnessing a modeling shift from CNN to Transformers in computer vision. In this work, we present a self-supervised learning approach called MoBY, with Vision Transformers as its backbone architecture. The approach basically has no new inventions, which is combined from MoCo v2 and BYOL and tuned to achieve reasonably high accuracy on ImageNet-1K linear evaluation: 72.8% and 75.0% top-1 accuracy using DeiT-S and Swin-T, respectively, by 300-epoch training. The performance is slightly better than recent works of MoCo v3 and DINO which adopt DeiT as the backbone, but with much lighter tricks. More importantly, the general-purpose Swin Transformer backbone enables us to also evaluate the learnt representations on downstream tasks such as object detection and semantic segmentation, in contrast to a few recent approaches built on ViT/DeiT which only report linear evaluation results on ImageNet-1K due to ViT/DeiT not tamed for these dense prediction tasks. We hope our results can facilitate more comprehensive evaluation of self-supervised learning methods designed for Transformer architectures. Our code and models are available at https://github.com/SwinTransformer/Transformer-SSL, which will be continually enriched.
翻译:我们目睹了计算机视野的模型从CNN向变异器的转变。 在这项工作中,我们展示了一种自监督的学习方法,称为MoBY,以愿景变异器作为其主干结构。这种方法基本上没有新的发明,这些发明来自MOCO v2和BYOL,并且经过调整,在图像Net-1K线性评价上实现了相当高的精确度:72.8%和75.0%最高至1%的精度分别使用DeiT-S和Swin-T的300-epoch培训。其性能略好于MoCo v3和DINO最近的工作,它们采用DeiT作为主干,但采用轻得多的技巧。更重要的是,通用双向变异主干主干主干主干,使我们能够也评估在下游任务(如物体探测和语义分化)上所学会的演示,而最近在ViT/Deiat/Deiet报告中只报告了图像网-1K的线性评价结果,因为VT/DeiT没有对这些密集的预测任务进行调制。我们希望我们的结果能够促进对设计出更全面的自我监督的学习方法进行更全面的评价。我们的结果可以用于变换/变换的模型。我们的代码和变换式/变换式模型。我们的代码和变换式模型。