Automated question generation is an important approach to enable personalisation of English comprehension assessment. Recently, transformer-based pretrained language models have demonstrated the ability to produce appropriate questions from a context paragraph. Typically, these systems are evaluated against a reference set of manually generated questions using n-gram based metrics, or manual qualitative assessment. Here, we focus on a fully automated multiple-choice question generation (MCQG) system where both the question and possible answers must be generated from the context paragraph. Applying n-gram based approaches is challenging for this form of system as the reference set is unlikely to capture the full range of possible questions and answer options. Conversely manual assessment scales poorly and is expensive for MCQG system development. In this work, we propose a set of performance criteria that assess different aspects of the generated multiple-choice questions of interest. These qualities include: grammatical correctness, answerability, diversity and complexity. Initial systems for each of these metrics are described, and individually evaluated on standard multiple-choice reading comprehension corpora.
翻译:自动问题生成是使英语理解评估实现个性化的一个重要方法。最近,基于变压器的预先训练语言模型已经证明有能力从上下文段落中提出适当的问题。通常,这些系统是参照一组人工产生的参考问题,使用基于 n 克的量度,或人工定性评估。在这里,我们侧重于一个完全自动化的多选制问题生成系统,必须从上下文段落中产生问题和可能的答案。应用基于 n 克的方法对这一系统形式具有挑战性,因为参考集不可能涵盖所有可能的问题和回答选项。相反,人工评估尺度很低,对MCQG系统开发费用昂贵。我们在此工作中,提出一套业绩标准,评估生成的多选制问题的不同方面。这些特性包括:语法正确性、可回答性、多样性和复杂性。每种指标的初始系统都有描述,并根据标准的多选读理解组合进行个别评估。