One of the most popular paradigms of applying large pre-trained NLP models such as BERT is to fine-tune it on a smaller dataset. However, one challenge remains as the fine-tuned model often overfits on smaller datasets. A symptom of this phenomenon is that irrelevant or misleading words in the sentence, which are easy to understand for human beings, can substantially degrade the performance of these finetuned BERT models. In this paper, we propose a novel technique, called Self-Supervised Attention (SSA) to help facilitate this generalization challenge. Specifically, SSA automatically generates weak, token-level attention labels iteratively by probing the fine-tuned model from the previous iteration. We investigate two different ways of integrating SSA into BERT and propose a hybrid approach to combine their benefits. Empirically, through a variety of public datasets, we illustrate significant performance improvement using our SSA-enhanced BERT model.
翻译:在应用诸如BERT等经过预先训练的大型国家清单模型方面,最受欢迎的范例之一是微调该模型在较小的数据集中的位置,然而,一个挑战仍然存在,因为微调模型往往在较小的数据集中过度使用。这一现象的一个症状是,该句中不相干或误导的词句对于人来说很容易理解,可以大大降低这些经过微调的国家清单模型的性能。在本文中,我们提议了一种新颖技术,称为“自我监督关注”以帮助推动这一普遍化挑战。具体地说,通过对以前的版本的微调模型进行研究,特别服务协定自动产生微弱的、象征性的注意标签。我们调查了将特别服务协定纳入生物清单的两种不同方法,并提出一种混合方法,以综合其益处。我们通过各种公共数据集,用我们的特别服务协定强化的BERT模型来说明显著的业绩改进。