Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.
翻译:开发自然语言生成模型(NLG)的自动度量是开发自然语言生成模型(NLG),特别是制作故事等开放语言生成任务所必需的。然而,现有自动度量与人类评估的关系很差。由于缺乏标准化的基准数据集,难以充分评价衡量标准的能力,并公平比较不同的计量标准。因此,我们提出OpenMEVA,一个评估开放型故事生成指标的基准。OpenMEVA提供一个全面的测试套件,以评估衡量衡量标准的能力,包括(a) 与人类判断的相关性,(b) 不同模型产出和数据集的概括化,(c) 判断故事一致性的能力,以及(d) 扰动性强。为此,OpenMEVA包括人工加注解的故事和自动构建的测试实例。我们评估OpenMEVA的现有度量度量,发现它们与人类判断的关联性很差,没有认识到对话层次的不一致性,缺乏推断性知识(例如,事件之间的因果关系),普遍化能力和可靠性。我们的研究为开发NLG模型和指标提供了深入的洞察力。