There are many ways to express similar things in text, which makes evaluating natural language generation (NLG) systems difficult. Compounding this difficulty is the need to assess varying quality criteria depending on the deployment setting. While the landscape of NLG evaluation has been well-mapped, practitioners' goals, assumptions, and constraints -- which inform decisions about what, when, and how to evaluate -- are often partially or implicitly stated, or not stated at all. Combining a formative semi-structured interview study of NLG practitioners (N=18) with a survey study of a broader sample of practitioners (N=61), we surface goals, community practices, assumptions, and constraints that shape NLG evaluations, examining their implications and how they embody ethical considerations.
翻译:文本中有许多表达类似内容的方法,这使得评价自然语言生成系统(NLG)难于进行。除了这一困难外,还需要根据部署情况评估不同的质量标准。虽然国家语言生成系统评价的全貌得到了很好地标定,但实践者的目标、假设和制约因素 -- -- 指导关于什么、何时和如何评估的决定 -- -- 往往部分地或含蓄地说明,或根本没有说明。将国家语言生成系统从业人员的成型半结构访谈研究(N=18)与更广泛的实践者抽样调查研究(N=61)结合起来,我们表面目标、社区做法、假设和制约形成了国家语言生成系统评价,审查其影响及其如何体现道德考虑。