Measuring the performance of natural language processing models is challenging. Traditionally used metrics, such as BLEU and ROUGE, originally devised for machine translation and summarization, have been shown to suffer from low correlation with human judgment and a lack of transferability to other tasks and languages. In the past 15 years, a wide range of alternative metrics have been proposed. However, it is unclear to what extent this has had an impact on NLP benchmarking efforts. Here we provide the first large-scale cross-sectional analysis of metrics used for measuring performance in natural language processing. We curated, mapped and systematized more than 3500 machine learning model performance results from the open repository 'Papers with Code' to enable a global and comprehensive analysis. Our results suggest that the large majority of natural language processing metrics currently used have properties that may result in an inadequate reflection of a models' performance. Furthermore, we found that ambiguities and inconsistencies in the reporting of metrics may lead to difficulties in interpreting and comparing model performances, impairing transparency and reproducibility in NLP research.
翻译:测量自然语言处理模型的性能具有挑战性。传统上使用的指标,如BLEU和ROUGE,最初设计用于机器翻译和总结的BLEU和ROUGE,已经证明与人类判断的低相关性和无法转移到其他任务和语言上。在过去15年中,提出了广泛的替代指标,但尚不清楚这对自然语言处理模型的基准工作产生了多大影响。我们在这里对用于测量自然语言处理的性能的量度进行了第一次大规模跨部门分析。我们整理、绘制和系统化了3500多份来自开放储存库“用代码的Pappers”的机器学习性能模型结果,以便能够进行全球和全面的分析。我们的结果表明,目前使用的绝大多数自然语言处理指标的特性可能导致对模型性能的反映不足。此外,我们发现,在报告衡量尺度时的模糊性和不一致性可能会导致难以解释和比较模型性能,损害国家语言处理模型研究的透明度和再生能力。