Vision Transformers have been incredibly effective when tackling computer vision tasks due to their ability to model long feature dependencies. By using large-scale training data and various self-supervised signals (e.g., masked random patches), vision transformers provide state-of-the-art performance on several benchmarking datasets, such as ImageNet-1k and CIFAR-10. However, these vision transformers pretrained over general large-scale image corpora could only produce an anisotropic representation space, limiting their generalizability and transferability to the target downstream tasks. In this paper, we propose a simple and effective Label-aware Contrastive Training framework LaCViT, which improves the isotropy of the pretrained representation space for vision transformers, thereby enabling more effective transfer learning amongst a wide range of image classification tasks. Through experimentation over five standard image classification datasets, we demonstrate that LaCViT-trained models outperform the original pretrained baselines by around 9% absolute Accuracy@1, and consistent improvements can be observed when applying LaCViT to our three evaluated vision transformers.
翻译:视觉转换器由于其能够建模长特征依赖关系,在处理计算机视觉任务时非常有效。通过使用大规模的训练数据和各种自监督信号(例如,随机掩码补丁),这些视觉转换器在多个基准数据集(例如,ImageNet-1k和CIFAR-10)上提供了最先进的性能。然而,这些在通用大规模图像语料库上预训练的视觉转换器只能产生各向异性表示空间,限制了它们在目标下游任务中的通用性和可转移性。在本文中,我们提出了一种简单而有效的标签感知对比训练框架LaCViT,该框架改善了视觉转换器预训练表示空间的各向同性,从而实现了在各种图像分类任务之间更有效的迁移学习。通过对五个标准图像分类数据集的实验,我们证明LaCViT训练的模型比原始预训练基线提高了约9%绝对Accuracy @ 1,并且在将LaCViT应用于我们评估的三个视觉转换器时都可以观察到稳定的改进。