For unsupervised pretraining, mask-reconstruction pretraining (MRP) approaches randomly mask input patches and then reconstruct pixels or semantic features of these masked patches via an auto-encoder. Then for a downstream task, supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task. Specifically, we assume that pretraining dataset contains multi-view samples of ratio $1-\mu$ and single-view samples of ratio $\mu$, where multi/single-view samples has multiple/single discriminative semantics. Then for pretraining, we prove that 1) the convolution kernels of the MRP encoder captures all discriminative semantics in the pretraining data; and 2) a convolution kernel captures at most one semantic. Accordingly, in the downstream supervised fine-tuning, most semantics would be captured and different semantics would not be fused together. This helps the downstream fine-tuned network to easily establish the relation between kernels and semantic class labels. In this way, the fine-tuned encoder in MRP provably achieves zero test error with high probability for both multi-view and single-view test data. In contrast, as proved by~[3], conventional SL can only obtain a test accuracy between around $0.5\mu$ for single-view test data. These results together explain the benefits of MRP in downstream tasks. Experimental results testify to multi-view data assumptions and our theoretical implications.
翻译:对于未经监督的训练前, 蒙面重建前训练( MRP) 使用随机掩码输入补丁, 然后通过自动编码器来重建这些掩码的像素或语义特征。 然后, 在下游任务中, 监督对事先训练的编码器进行微调, 大大超过常规监管的从头训练的学习( SL) 。 但是, 仍然不清楚 1 : MRP 如何在训练前阶段进行语义学学习, 以及为什么它有助于下游任务 。 为了解决这些问题, 我们理论上显示, 在两层/ 一层调的调试样上, 两层/ 一层调的调试金字器/ decoder, MRP可以捕捉取所有在预调的货币和二层调试算结果 。 在一次调试算之前, 我们假设, 预调的数据集包含一个多层抽样样本 $\ mue, 多/ 常规抽样的样本会通过多个/ 级评调的调的调值来解释。 然后, 我们证明, 最深层次的变变的机机机机机机级 的 级 级 级 级 级 级 级 级 级 级 测算中, 级 级的 级 都能够捕测测测测测算所有的 。