We study the training of Vision Transformers for semi-supervised image classification. Transformers have recently demonstrated impressive performance on a multitude of supervised learning tasks. Surprisingly, we find Vision Transformers perform poorly on a semi-supervised ImageNet setting. In contrast, Convolutional Neural Networks (CNNs) achieve superior results in small labeled data regime. Further investigation reveals that the reason is CNNs have strong spatial inductive bias. Inspired by this observation, we introduce a joint semi-supervised learning framework, Semiformer, which contains a Transformer branch, a Convolutional branch and a carefully designed fusion module for knowledge sharing between the branches. The Convolutional branch is trained on the limited supervised data and generates pseudo labels to supervise the training of the transformer branch on unlabeled data. Extensive experiments on ImageNet demonstrate that Semiformer achieves 75.5\% top-1 accuracy, outperforming the state-of-the-art. In addition, we show Semiformer is a general framework which is compatible with most modern Transformer and Convolutional neural architectures.
翻译:我们研究对视觉变异器进行半监督图像分类的培训。 变异器最近展示了众多监督学习任务的令人印象深刻的绩效。 令人惊讶的是, 我们发现视觉变异器在半监督图像网络设置上表现不佳。 相反, 革命神经网络(CNNs)在小型标签数据系统中取得了优异效果。 进一步的调查显示, CNNs 的原因具有很强的空间感应偏差。 在这项观察的启发下, 我们引入了一个联合半监督学习框架, 包括一个变异器分支、 Convolution 分支以及一个精心设计的用于各分支之间知识共享的聚合模块。 革命分支在有限的监督数据上受到训练, 并生成假标签以监督变异器分支在未标签数据上的培训。 图像网络的广泛实验显示, 半成像器实现了75.5 ⁇ 顶级1 的准确度, 超过了状态。 此外, 我们显示, 变异器是一个与最现代变异器和进神经结构兼容的总框架。