Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted. This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surface-level features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize, categorize, and discuss how researchers have been addressing these issues and what their findings mean for the current state of model evaluations. Building on those insights, we lay out a long-term vision for NLG evaluation and propose concrete steps for researchers to improve their evaluation processes. Finally, we analyze 66 NLG papers from recent NLP conferences in how well they already follow these suggestions and identify which areas require more drastic changes to the status quo.
翻译:自然语言生成方面的评价做法有许多已知的缺陷,但改进的评价方法很少被广泛采用,这个问题已变得更加紧迫,因为神经NLG模型已经改进,以致无法再根据老的衡量标准所依赖的表面特征加以区分。本文对过去20年来已经指出的人类和自动模型评价以及国家语言生成方面的通用数据集问题进行了调查。我们总结、分类和讨论研究人员如何处理这些问题,以及他们的调查结果对目前模式评价状况意味着什么。我们根据这些见解,为NLG评估提出了长期的远景规划,并为研究人员改进评估进程提出了具体步骤。最后,我们分析了最近国家语言规划会议提出的66份NLG文件,分析它们在多大程度上已经遵循了这些建议,并查明哪些领域需要更大幅度地改变现状。