The evaluation of abstractive summarization models typically uses test data that is identically distributed as training data. In real-world practice, documents to be summarized may contain input noise caused by text extraction artifacts or data pipeline bugs. The robustness of model performance under distribution shift caused by such noise is relatively under-studied. We present a large empirical study quantifying the sometimes severe loss in performance (up to 12 ROUGE-1 points) from different types of input noise for a range of datasets and model sizes. We then propose a light-weight method for detecting and removing such noise in the input during model inference without requiring any extra training, auxiliary models, or even prior knowledge of the type of noise. Our proposed approach effectively mitigates the loss in performance, recovering a large fraction of the performance drop, sometimes as large as 11 ROUGE-1 points.
翻译:对抽象总结模型的评价通常使用与培训数据相同分布的测试数据。在现实世界中,要总结的文件可能包含文本提取文物或数据管道错误造成的输入噪音。这种噪音造成的分配转移中模型性能的稳健性研究相对不足。我们提出了一个大型实证研究,对各种数据集和模型大小的不同类型输入噪音造成的有时严重性能损失(高达12个ROUGE-1点)进行量化。然后,我们提出一种轻量级方法,用以在模型推断期间探测和消除输入中的这种噪音,而无需任何额外的培训、辅助模型,甚至无需事先了解噪音的类型。我们提议的办法有效地减轻了性能损失,恢复了性能下降的很大一部分,有时甚至达到11个ROUGE-1点。