Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective based on the nature of information change in NLG tasks, including compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). Information alignment between input, context, and output text plays a common central role in characterizing the generation. With automatic alignment prediction models, we develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog.
翻译:自然语言生成( NLG) 包含广泛的任务, 每一个任务都服务于具体目标, 并期望生成文本的不同特性。 复杂性使得对NLG的自动评价特别具有挑战性。 以往的工作通常侧重于一项单一的任务,并根据具体的直觉制定了个别的评价指标。 在本文中, 我们基于NLG任务的信息变化性质提出了一个统一的观点, 包括压缩( 如总和)、 转换( 例如文字重写) 和创建( 例如, 对话框 ) 。 投入、 上下文和产出文本之间的信息协调在生成特征方面发挥着共同的中心作用。 在自动调整预测模型中, 我们开发了一套可解释的、 适合于评估不同NLG任务关键方面的指标, 通常不需要黄金参考数据。 实验显示, 统一设计的计量与人类判断之间与每个不同任务( 包括文字总称、 风格转移和知识基础对话) 中的最新指标相比, 实现了更强或可比的关联性。