Fast and reliable evaluation metrics are key to R&D progress. While traditional natural language generation metrics are fast, they are not very reliable. Conversely, new metrics based on large pretrained language models are much more reliable, but require significant computational resources. In this paper, we propose FrugalScore, an approach to learn a fixed, low cost version of any expensive NLG metric, while retaining most of its original performance. Experiments with BERTScore and MoverScore on summarization and translation show that FrugalScore is on par with the original metrics (and sometimes better), while having several orders of magnitude less parameters and running several times faster. On average over all learned metrics, tasks, and variants, FrugalScore retains 96.8% of the performance, runs 24 times faster, and has 35 times less parameters than the original metrics. We make our trained metrics publicly available, to benefit the entire NLP community and in particular researchers and practitioners with limited resources.
翻译:快速和可靠的评价指标是研发进展的关键。传统自然语言生成指标虽然速度很快,但并不十分可靠。相反,基于大型预先培训的语言模型的新指标则更加可靠,但需要大量计算资源。在本文中,我们提议FrugalScore(FrugalScore),这是一种学习任何昂贵NLG指标固定、低成本版本的方法,同时保留其大部分原始性能。与BERTScore(BERTScore)和MolerScore(MolerScore)关于总结和翻译的实验显示,FrugScore(FrugalScore)与原始指标(有时甚至更好)相当,同时具有几个数量级的参数,而且运行速度要快几倍。平均而言,在所有学过的指标、任务和变体中,FrugalScore(FrugalScore)保留了96.8%的性能,运行速度比原始指标快24倍,而且比原始的参数少35倍。我们公开了我们经过培训的衡量的尺度,以惠及整个NLP社区,特别是资源有限的研究人员和从业者。