Natural Language Generation (NLG) refers to the operation of expressing the calculation results of a system in human language. Since the quality of generated sentences from an NLG model cannot be fully represented using only quantitative evaluation, they are evaluated using qualitative evaluation by humans in which the meaning or grammar of a sentence is scored according to a subjective criterion. Nevertheless, the existing evaluation methods have a problem as a large score deviation occurs depending on the criteria of evaluators. In this paper, we propose Grammar Accuracy Evaluation (GAE) that can provide the specific evaluating criteria. As a result of analyzing the quality of machine translation by BLEU and GAE, it was confirmed that the BLEU score does not represent the absolute performance of machine translation models and GAE compensates for the shortcomings of BLEU with flexible evaluation of alternative synonyms and changes in sentence structure.
翻译:自然语言生成(NLG)是指用人文表达系统计算结果的操作;由于使用定量评价不能充分体现从NLG模式中生成的句子的质量,因此评价时使用人的质量评价,根据主观标准评分判决的含义或语法;然而,现有评价方法存在问题,因为根据评价员的标准,得分差异很大;在本文件中,我们提议语法准确性评价(GAE)可以提供具体的评价标准;通过分析BLEU和GAE的机器翻译质量,证实BLEU分数并不代表机器翻译模型的绝对性性能,而GAE用灵活评价其他同义词和句结构变化来弥补BLEU的缺点。