Question generation (QGen) models are often evaluated with standardized NLG metrics that are based on n-gram overlap. In this paper, we measure whether these metric improvements translate to gains in a practical setting, focusing on the use case of helping teachers automate the generation of reading comprehension quizzes. In our study, teachers building a quiz receive question suggestions, which they can either accept or refuse with a reason. Even though we find that recent progress in QGen leads to a significant increase in question acceptance rates, there is still large room for improvement, with the best model having only 68.4% of its questions accepted by the ten teachers who participated in our study. We then leverage the annotations we collected to analyze standard NLG metrics and find that model performance has reached projected upper-bounds, suggesting new automatic metrics are needed to guide QGen research forward.
翻译:问题产生模式(QGen)往往用基于n-gram重叠的标准化NLG衡量标准来评估。在本文中,我们衡量这些衡量标准改进是否在实际环境下转化为收益,重点是帮助教师将阅读理解测试自动化的运用案例。在我们的研究中,建立测试的教师收到问题建议,他们可以接受,也可以拒绝,但有原因。尽管我们发现QGen最近的进展导致问题接受率的大幅提高,但仍有很大的改进余地,最佳模式只有参加我们研究的10名教师接受的68.4%的问题。然后我们利用我们收集的说明分析标准NLG衡量标准,发现模型业绩达到了预期的上限,建议新的自动衡量标准来指导QGen的研究向前发展。