Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest NND can give a second life to human annotations and provide low-cost NLG evaluation.
翻译:准确评估自然语言生成(NLG)任务的进展是一项艰巨的任务,而确定模型产出优于另一个模型的人力评价往往十分必要。然而,人类评价通常费用高昂,难以复制,而且不可重复。在本文件中,我们提议为NLG(称为近非差异(NND))提出一个新的简单自动评价方法,将先前的人类说明重新用于NND测试。在NND测试中,NLG模型必须把高品质产出候选产品置于高于已知错误的接近负值候选产品的可能性上。模型性能由NND测试模型通过次数确定,模型失败了对特定任务错误的分布。通过三个NLG任务(问题生成、问题回答和总结)的实验,我们显示NND与人类判断的相关性高于标准NLG评价指标。我们然后在四种实际假设中说明NND评价,例如进行微缩模型分析,或研究模型培训动态。我们的研究结果表明NND可以给人类说明第二个生命,并提供低成本的NG评价。