Various evaluation metrics exist for natural language generation tasks, but they have limited utility for story generation since they generally do not correlate well with human judgments and do not measure fine-grained story aspects, such as fluency versus relatedness, as they are intended to assess overall generation quality. In this paper, we propose deltascore, an approach that utilizes perturbation to evaluate fine-grained story aspects. Our core idea is based on the hypothesis that the better the story performs in a specific aspect (e.g., fluency), the more it will be affected by a particular perturbation (e.g., introducing typos). To measure the impact, we calculate the likelihood difference between the pre- and post-perturbation stories using a language model. We evaluate deltascore against state-of-the-art model-based and traditional similarity-based metrics across multiple story domains, and investigate its correlation with human judgments on five fine-grained story aspects: fluency, coherence, relatedness, logicality, and interestingness. Our results demonstrate that deltascore performs impressively in evaluating fine-grained story aspects, and we discovered a striking outcome where a specific perturbation appears to be highly effective in measuring most aspects.
翻译:各种评价指标用于自然语言生成任务,但它们对于故事生成的作用有限,因为它们通常与人类判断的相关性不高,也不测量细粒度的故事方面,例如流利度与关联性,因为它们旨在评估整体生成质量。在本文中,我们提出 deltascore,一种利用扰动评估细粒度故事方面的方法。我们的核心思想基于假设,即故事在特定方面表现得越好(如流畅度),它会更受到特定扰动(如引入拼写错误)的影响。为了衡量影响,我们使用语言模型计算了前后扰动故事之间的可能差异。我们在多个故事领域上评估了 deltascore 与最先进的基于模型和传统的基于相似性的度量,并调查了它与五个细粒度故事方面的人类判断之间的相关性:流畅度,连贯性,关联性,合理性和趣味性。我们的结果表明,deltascore 在评估细粒度故事方面的表现令人印象深刻,并且我们发现了一个引人注目的结果,在衡量大多数方面时,特定的扰动似乎非常有效。