The automatic generation of Multiple Choice Questions (MCQ) has the potential to reduce the time educators spend on student assessment significantly. However, existing evaluation metrics for MCQ generation, such as BLEU, ROUGE, and METEOR, focus on the n-gram based similarity of the generated MCQ to the gold sample in the dataset and disregard their educational value. They fail to evaluate the MCQ's ability to assess the student's knowledge of the corresponding target fact. To tackle this issue, we propose a novel automatic evaluation metric, coined Knowledge Dependent Answerability (KDA), which measures the MCQ's answerability given knowledge of the target fact. Specifically, we first show how to measure KDA based on student responses from a human survey. Then, we propose two automatic evaluation metrics, KDA_disc and KDA_cont, that approximate KDA by leveraging pre-trained language models to imitate students' problem-solving behavior. Through our human studies, we show that KDA_disc and KDA_soft have strong correlations with both (1) KDA and (2) usability in an actual classroom setting, labeled by experts. Furthermore, when combined with n-gram based similarity metrics, KDA_disc and KDA_cont are shown to have a strong predictive power for various expert-labeled MCQ quality measures.
翻译:自动生成多重选择问题(MCQ)有可能大大缩短教育工作者在学生评估上花费的时间。然而,现有的衡量MCQ一代人(如BLEU、ROUGE和METEOR)的评价指标侧重于生成的MCQ与数据集中的黄金样本的n-gram相似性,忽视了他们的教育价值。它们未能评估MCQ评估学生对相应目标事实的知识的能力。为了解决这一问题,我们提出了一个新的自动评价指标、创造知识依赖性(KDA),根据对目标事实的了解衡量MCQ的可答性。具体地说,我们首先展示了如何根据学生在人类调查中的反应衡量KDA的可答度。然后,我们提出了两种自动评价指标,即KDA_disc和KDA_cont,通过利用预先培训的语言模型来模拟学生的解决问题行为。通过我们的人类研究,我们表明KDA_disc和KDA_soft与(1) KDA和(2) 联合质量措施,在以实际的KDA标准标签上展示了一种强大的标准。