Current pre-trained language models rely on large datasets for achieving state-of-the-art performance. However, past research has shown that not all examples in a dataset are equally important during training. In fact, it is sometimes possible to prune a considerable fraction of the training set while maintaining the test performance. Established on standard vision benchmarks, two gradient-based scoring metrics for finding important examples are GraNd and its estimated version, EL2N. In this work, we employ these two metrics for the first time in NLP. We demonstrate that these metrics need to be computed after at least one epoch of fine-tuning and they are not reliable in early steps. Furthermore, we show that by pruning a small portion of the examples with the highest GraNd/EL2N scores, we can not only preserve the test accuracy, but also surpass it. This paper details adjustments and implementation choices which enable GraNd and EL2N to be applied to NLP.
翻译:然而,过去的研究表明,在培训期间,数据集中并非所有实例都同等重要。事实上,有时在保持测试性能的同时,可以将培训数据集的相当一部分加以精减。根据标准愿景基准,两种基于梯度的评分标准是GraNd及其估计版本EL2N。在这项工作中,我们首次在NLP中使用了这两套衡量标准。我们证明,这些衡量标准需要至少经过一个微小的微调后才能计算,在早期阶段它们不可靠。此外,我们表明,通过剪切除一小部分具有最高GraNd/EL2N分数的示例,我们不仅能够保持测试准确性,而且能够超过它。本文详细介绍了使GraNd和EL2N能够应用于NLP的调整和执行选择。