We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.
翻译:我们为视觉变压器(ViT)研究半监督学习(SSL),这是尽管广泛采用ViT结构以完成不同任务,但探索不足的一个课题。为了解决这个问题,我们提议一个新的SSL管道,包括第一个未经/自我监督的预培训,随后是监督的微调,最后是半监督的微调阶段。在半监督的微调阶段,我们采用指数移动平均(EMA)-教师框架,而不是流行的FixMatch,因为前者比较稳定,为半监督的视觉变压器提供更高的精度。此外,我们提议一个新的SSL管道,包括第一个未经/自我监督的预培训,然后是监督的微调,最后是半监督的微调阶段,我们采用指数化平均平均平均(EMA)-教师框架,在半监督的分类设置中,我们采用指数平均平均(EMA)-教师框架,因为前者比较稳定,对半监督的变压变压变压器的变精度。此外,我们提议一个概率化的假模混合机制,用来将无比重的样本升级到80的图像模型。