Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC). Every MCQ needs a set of distractor answers that are incorrect, but plausible enough to test student knowledge. Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics. However, MT metrics often misjudge the suitability of generated distractors. We propose DISTO: the first learned evaluation metric for generated distractors. We validate DISTO by showing its scores correlate highly with human ratings of distractor quality. At the same time, DISTO ranks the performance of state-of-the-art DG models very differently from MT-based metrics, showing that MT metrics should not be used for distractor evaluation.
翻译:多项选择题(MCQs)是评估阅读理解能力的有效且普遍的方法。每个MCQ 需要一组干扰答案,这些答案是不正确但足够合理的,以测试学生的知识。已经提出了干扰项生成(DG)模型,并且它们的表现通常使用机器翻译(MT)指标进行评估。然而,MT指标经常错误评估所生成的干扰项的适用性。我们提出了DISTO,这是第一个为生成的干扰项设计的学习性评估指标。我们通过显示它的得分与人类对干扰项质量的评估高度相关来验证DISTO。同时,DISTO与基于MT的指标在评估最先进的DG模型的表现时有着完全不同的排名,表明MT指标不应用于干扰项评估。