We survey human evaluation in papers presenting work on creative natural language generation that have been published in INLG 2020 and ICCC 2020. The most typical human evaluation method is a scaled survey, typically on a 5 point scale, while many other less common methods exist. The most commonly evaluated parameters are meaning, syntactic correctness, novelty, relevance and emotional value, among many others. Our guidelines for future evaluation include clearly defining the goal of the generative system, asking questions as concrete as possible, testing the evaluation setup, using multiple different evaluation setups, reporting the entire evaluation process and potential biases clearly, and finally analyzing the evaluation results in a more profound way than merely reporting the most typical statistics.
翻译:我们用在INLG 2020 和 ICCC 2020 上发表的介绍创造性自然语言生成工作的文件调查人类评价。最典型的人类评价方法是规模化调查,通常规模为5点,而其他方法则不那么常见。最普遍评估的参数是含义、综合正确性、新颖性、相关性和情感价值等。我们的未来评价准则包括明确界定基因化系统的目标,提出尽可能具体的问题,利用多种不同的评价机制测试评价设置,明确报告整个评价过程和潜在偏差,最后更深刻地分析评价结果,而不仅仅是报告最典型的统计数据。