自蒸馏用于Transformers进一步预训练 (Self-Distillation for Further Pre-training of Transformers)

Pre-training a large transformer model on a massive amount of unlabeled data and fine-tuning it on labeled datasets for diverse downstream tasks has proven to be a successful strategy, for a variety of vision and natural language processing tasks. However, direct fine-tuning of the pre-trained model may be suboptimal if there exist large discrepancies across data domains for pre-training and fine-tuning. To tackle this issue, several previous studies have proposed further pre-training strategies, where we continue to pre-train the model on the target unlabeled dataset before fine-tuning. However, all of them solely focus on language models and we empirically find that a Vision Transformer is vulnerable to overfitting as we continue to pretrain the model on target unlabeled data. In order to tackle this limitation, we propose self-distillation as a regularization for a further pre-training stage. Specifically, we first further pre-train the initial pre-trained model on the target unlabeled data and then consider it as a teacher for self-distillation. Then we take the same initial pre-trained model as a student and enforce its hidden representations to be close to those of the teacher while optimizing the student with a masked auto-encoding objective. We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks. Experimentally, we show that our proposed method outperforms all the relevant baselines. Theoretically, we analyze the proposed method with a simplified model to understand how self-distillation for further pre-training can potentially help improve the performance of the downstream tasks.

翻译：通过在大量无标签数据上预训练大型Transformer模型，并在各种下游任务上微调已经被证明是一种成功的策略，特别是对于视觉和自然语言处理任务。然而，如果在预训练和微调之间存在大的数据域的差异，直接微调预训练模型可能效果不佳。为了解决这个问题，一些先前的研究提出了进一步预训练的策略，即在微调之前继续在目标无标签数据集上预训练模型。然而，所有这些方法都仅关注于语言模型，我们经验性发现Vision Transformer很容易在目标无标签数据上过拟合。为了解决这个问题，我们提出了一种自蒸馏来作为进一步预训练阶段的正则化方法。具体来说，我们首先在目标无标签数据上进一步预训练初始预训练模型，然后将其作为教师模型进行自蒸馏。然后我们以初始预训练模型为学生模型，通过面向自编码的目标以优化学生模型，同时强制学生模型的隐藏表示接近教师模型的隐藏表示。我们通过实验证明了自蒸馏在各种基准数据集的图像和文本分类任务中的有效性，实验表明我们的方法优于所有相关基线方法。在理论上，我们利用一个简化的模型来分析所提出的方法，以了解自蒸馏如何有助于提高下游任务的性能。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【西湖大学】图预训练方法体系综述，A Survey of Pre-training on Graphs: Taxonomy, Methods and Applications

专知会员服务

43+阅读 · 2022年3月25日

对比学习简述

专知会员服务

90+阅读 · 2021年6月29日

【EMNLP2020】低资源域适应的多阶段预训练

专知会员服务

19+阅读 · 2020年10月13日

【ICML2020】统一预训练伪掩码语言模型

专知会员服务

27+阅读 · 2020年7月23日