Language model fine-tuning is essential for modern natural language processing, but is computationally expensive and time-consuming. Further, the effectiveness of fine-tuning is limited by the inclusion of training examples that negatively affect performance. Here we present a general fine-tuning method that we call information gain filtration for improving the overall training efficiency and final performance of language model fine-tuning. We define the information gain of an example as the improvement on a test metric after training on that example. A secondary learner is then trained to approximate this quantity. During fine-tuning, this learner selects informative examples and skips uninformative ones. We show that our method has consistent improvement across datasets, fine-tuning tasks, and language model architectures. For example, we achieve a median perplexity of 54.0 on a books dataset compared to 57.3 for standard fine-tuning. We present statistical evidence that offers insight into the improvements of our method over standard fine-tuning. The generality of our method leads us to propose a new paradigm for language model fine-tuning -- we encourage researchers to release pretrained secondary learners on common corpora to promote efficient and effective fine-tuning, thereby improving the performance and reducing the overall energy footprint of language model fine-tuning.
翻译:语言模型的微调是现代自然语言处理所必不可少的,但计算成本昂贵和耗时。此外,微调的效力因纳入对业绩有负面影响的培训实例而受到限制。我们在这里提出了一个一般微调方法,称为信息增益过滤,以提高整体培训效率和语言模型微调的最终性能。我们把一个例子的信息增益定义为在培训后改进测试度量。然后,对一个二级学习者进行培训,以接近这一数量。在微调期间,该学习者选择了信息性实例,并跳过一些不提供信息性能的实例。我们表明,我们的方法在数据集、微调任务和语言模型结构方面都得到了一致的改进。例如,我们在一个书集中实现了54.0的中位易读性,而标准微调则达到57.3的中位。我们提供了统计证据,使我们对改进了标准微调的方法比标准微调高得多。我们的方法的一般性能导致我们提出一个新的语言模型微调调调范式。我们鼓励研究人员在共同的微调调调制语言模型上预先培训的二级学习者放出一个节能和全面微的成绩。