This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models are publicly available: https://github.com/microsoft/esvit
翻译:本文探讨了开发高效自我监督的视觉变压器(EsViT)以进行视觉演示学习的两种技术。 首先,我们通过全面的经验性研究表明,多阶段结构的自我关注程度低,可以大大降低模型的复杂性,但代价是丧失捕捉图像区域之间微粒通信的能力。第二,我们建议开展新的区域比对培训前任务,使模型能够捕捉微弱的区域依赖性,从而大大改善所学的视觉演示质量。我们的成果显示,将两种技术结合起来,EsViT在图像网络线性探测器评价中实现了81.3%的顶级-1,比以往艺术表现高,其吞吐量水平高。在向下游线性分类任务转移时,EsViT在18个数据集中的17个中超越了监督对应方。代码和模型可以公开查阅:https://github.com/microsoft/esvit。