We show that large pre-trained language models are inherently highly capable of identifying label errors in natural language datasets: simply examining out-of-sample data points in descending order of fine-tuned task loss significantly outperforms more complex error-detection mechanisms proposed in previous work. To this end, we contribute a novel method for introducing realistic, human-originated label noise into existing crowdsourced datasets such as SNLI and TweetNLP. We show that this noise has similar properties to real, hand-verified label errors, and is harder to detect than existing synthetic noise, creating challenges for model robustness. We argue that human-originated noise is a better standard for evaluation than synthetic noise. Finally, we use crowdsourced verification to evaluate the detection of real errors on IMDB, Amazon Reviews, and Recon, and confirm that pre-trained models perform at a 9-36% higher absolute Area Under the Precision-Recall Curve than existing models.
翻译:我们证明,大型经过培训的语文模型在自然语言数据集中具有识别标签错误的内在高度能力:简单地以微调任务损失的降序顺序审查抽样数据点,大大优于先前工作中提议的更复杂的错误探测机制。为此目的,我们为将现实的、人为的标签噪音引入现有的众源数据集,如SNLI和TweetNLP,提供了一种新的方法。我们表明,这种噪音与真实的、手验证的标签错误有相似的特性,比现有的合成噪音更难探测,给模型的坚固性带来挑战。我们争辩说,人为噪音是比合成噪音更好的评价标准。最后,我们利用众源核查来评估在IMDB、亚马逊评论和Recon上发现真实错误的情况,并证实预先训练的模型在比现有模型高9-36%的绝对区域运行。