The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. We then present two examples for task-specific NLG evaluations for automatic text summarization and long text generation, and conclude the paper by proposing future research directions.
翻译:过去几年开发的自然语言生成系统(NLG)评估方法的文件调查,我们将NLG评估方法分为三类:(1) 以人为中心的评价指标,(2) 不需要培训的自动衡量标准,(3) 机器学习衡量标准,我们讨论了每一类别所取得的进展和仍然面临的挑战,重点是评价最近提议的NLG任务和神经NLG模型,然后我们提出两个实例,供针对任务的NLG评价用于自动文本汇总和长文本生成,最后我们提出未来研究方向。