近近新异异:赋予人类评价数据集第二生命 (Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets)

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest NND can give a second life to human annotations and provide low-cost NLG evaluation.

翻译：准确评估自然语言生成(NLG)任务的进展是一项艰巨的任务,而确定模型产出优于另一个模型的人力评价往往十分必要。然而,人类评价通常费用高昂,难以复制,而且不可重复。在本文件中,我们提议为NLG(称为近非差异(NND))提出一个新的简单自动评价方法,将先前的人类说明重新用于NND测试。在NND测试中,NLG模型必须把高品质产出候选产品置于高于已知错误的接近负值候选产品的可能性上。模型性能由NND测试模型通过次数确定,模型失败了对特定任务错误的分布。通过三个NLG任务(问题生成、问题回答和总结)的实验,我们显示NND与人类判断的相关性高于标准NLG评价指标。我们然后在四种实际假设中说明NND评价,例如进行微缩模型分析,或研究模型培训动态。我们的研究结果表明NND可以给人类说明第二个生命,并提供低成本的NG评价。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/