Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.
翻译:在这项工作中,我们介绍了一项新的人类评价的结果,其中收集了一些最新年龄-年龄-年龄-语言代系统的流利和适当分数以及差错类型分类。我们讨论了这些系统的相对质量以及我们的结果如何与自动衡量系统相比较,发现虽然这些衡量标准在总体排名系统中大多是成功的,但收集人类的判断可以进行更细微的比较。我们还分析了这些系统造成的常见错误。