Knowledge Distillation (KD) is a commonly used technique for improving the generalization of compact Pre-trained Language Models (PLMs) on downstream tasks. However, such methods impose the additional burden of training a separate teacher model for every new dataset. Alternatively, one may directly work on the improvement of the optimization procedure of the compact model toward better generalization. Recent works observe that the flatness of the local minimum correlates well with better generalization. In this work, we adapt Stochastic Weight Averaging (SWA), a method encouraging convergence to a flatter minimum, to fine-tuning PLMs. We conduct extensive experiments on various NLP tasks (text classification, question answering, and generation) and different model architectures and demonstrate that our adaptation improves the generalization without extra computation cost. Moreover, we observe that this simple optimization technique is able to outperform the state-of-the-art KD methods for compact models.
翻译:知识蒸馏(KD)是改进关于下游任务的精练语言模型(PLM)一般化的常用技术,但是,这种方法增加了为每个新的数据集培训一个单独的教师模型的负担。或者,可以直接致力于改进紧凑模型的优化程序,使之更趋概括化。最近的工作发现,当地最低要求的平整与更概括化密切相关。在这项工作中,我们调整了Stochastic Weight Averageing(SWA),这是鼓励与最受欢迎的最低标准趋同的一种方法,用于微调PLM。我们就各种NLP任务(文字分类、问答和生成)和不同的模型结构进行了广泛的实验,并表明我们的适应性改进了通用,而没有额外的计算成本。此外,我们注意到,这种简单的优化技术能够超越常规模型中最先进的KD方法。