Vision Transformer (ViT) suffers from data scarcity in semi-supervised learning (SSL). To alleviate this issue, inspired by masked autoencoder (MAE), which is a data-efficient self-supervised learner, we propose Semi-MAE, a pure ViT-based SSL framework consisting of a parallel MAE branch to assist the visual representation learning and make the pseudo labels more accurate. The MAE branch is designed as an asymmetric architecture consisting of a lightweight decoder and a shared-weights encoder. We feed the weakly-augmented unlabeled data with a high masking ratio to the MAE branch and reconstruct the missing pixels. Semi-MAE achieves 75.9% top-1 accuracy on ImageNet with 10% labels, surpassing prior state-of-the-art in semi-supervised image classification. In addition, extensive experiments demonstrate that Semi-MAE can be readily used for other ViT models and masked image modeling methods.
翻译:视觉变异器(VIT)在半监督学习(SSL)中缺乏数据。为了缓解这一问题,在数据效率高的自我监督学习者(MAE)的启发下,我们提议采用半MAE,即纯VIT的SSL框架,由平行的MAE分支组成,以协助视觉演示学习,并使假标签更加准确。MAE分支设计为不对称结构,由轻量解码器和共享重量编码器组成。我们向MAE分支提供微缩未加标签的未加标签数据,并重建缺失的像素。半MAE在图像网络上实现了75.9%的顶级-1精度,贴了10%的标签,超过了半监督图像分类中先前的艺术水平。此外,广泛的实验表明,半MAE可以很容易地用于其他VIT模型和蒙面图像建模方法。