A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.
翻译:各种各样的 NLP 应用程序, 如机器翻译、 概括化和对话框, 涉及到文本生成。 这些应用程序的主要挑战是如何评价这些生成的文本是否真正流畅、 准确或有效。 在这项工作中, 我们将生成文本的评价概念化为文本生成问题, 模型使用预先培训的顺序到序列模型。 通常的想法是, 将生成文本转换成/ 从参考输出或源文本的模型, 当生成文本更好时会获得更高的分数。 我们使用基于预训练模型的交互式弱智( BART ) 来实施这个想法, 并提议一个具有若干变异的BARTScream, 可以灵活地在不监督的情况下应用这些变异来从不同角度评估文本( 如: 信息性、 流利度或事实质量 ) 。 BARTScore在概念上简单化了, 在22个测试环境中, 现有顶级测量测量标准, 包括16个数据集的评估( 如, 机器翻译、 缩略图) 和7个版本的SAR- realalalalal 平台上, 可以理解 。