Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.
翻译:通过在大量无标签数据上预训练大型Transformer模型,并在各种下游任务上微调已经被证明是一种成功的策略,特别是对于视觉和自然语言处理任务。然而,如果在预训练和微调之间存在大的数据域的差异,直接微调预训练模型可能效果不佳。为了解决这个问题,一些先前的研究提出了进一步预训练的策略,即在微调之前继续在目标无标签数据集上预训练模型。然而,所有这些方法都仅关注于语言模型,我们经验性发现Vision Transformer很容易在目标无标签数据上过拟合。为了解决这个问题,我们提出了一种自蒸馏来作为进一步预训练阶段的正则化方法。具体来说,我们首先在目标无标签数据上进一步预训练初始预训练模型,然后将其作为教师模型进行自蒸馏。然后我们以初始预训练模型为学生模型,通过面向自编码的目标以优化学生模型,同时强制学生模型的隐藏表示接近教师模型的隐藏表示。我们通过实验证明了自蒸馏在各种基准数据集的图像和文本分类任务中的有效性,实验表明我们的方法优于所有相关基线方法。在理论上,我们利用一个简化的模型来分析所提出的方法,以了解自蒸馏如何有助于提高下游任务的性能。