Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.
翻译:收集人类判断是目前自然语言生成系统最可靠的评价方法; 自动指标报告在衡量生成文本质量方面应用时存在缺陷,并显示与人类判断不相符; 然而,人类评价是时间和成本密集型的,我们在设计和开展人类评价实验方面缺乏共识; 因此,在评价自然语言生成系统时,需要采用简化的方法,有效收集人类判断; 因此,我们提出了一个动态方法,在相对比较环境中评价产出时,衡量所需数量的人的注释; 我们提出了一个基于代理的人类评价框架,以评估多种标签战略和方法,在模拟和众包案例研究中决定更好的模型; 主要结果显示,在不同标签战略中,对高级模型作出决定的概率很高,每次指派单一随机工人需要最低的总体标签努力,因而成本也最低。