For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches, e.g. MAE and data2vec, randomly mask input patches and then reconstruct the pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional ``supervised learning'' (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic feature learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we first theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative features of each potential semantic class in the pretraining dataset. Then considering the fact that the pretraining dataset is of huge size and high diversity and thus covers most features in downstream dataset, in fine-tuning phase, the pretrained encoder can capture as much features as it can in downstream datasets, and would not lost these features with theoretical guarantees. In contrast, SL only randomly captures some features due to lottery ticket hypothesis. So MRP provably achieves better performance than SL on the classification tasks. Experimental results testify to our data assumptions and also our theoretical implications.
翻译:对于未经监督的训练前,蒙面重建前训练(MRP)方法,例如MAE和Data2vec,随机掩码输入补丁,然后通过自动编码器重建这些蒙面补丁的像素或语义特征。然后,在下游任务中,监督对预先训练的编码器进行微调,大大超过常规的“监督学习”(SL)从头到尾训练。然而,目前还不清楚:(1) 蒙面重建在训练前阶段如何进行语义学学习,以及(2) 它为什么有助于下游任务。为了解决这些问题,我们首先理论上表明,在两个/一个层次的变异编码器/变异器的自动编码器上,MRP可以捕捉到在训练前数据集中每个潜在变异类的所有偏差特征。然后,考虑到培训前的数据集规模巨大,差异很大,因此,在微调阶段,预先训练的编码器可以捕捉到下游数据集中的大部分特征。在下游数据设置中可以捕捉到很多特征,我们下游数据设置的双向机级飞行的模型也不会失去这些特征。