Automated scoring of open-ended student responses has the potential to significantly reduce human grader effort. Recent advances in automated scoring often leverage textual representations based on pre-trained language models such as BERT and GPT as input to scoring models. Most existing approaches train a separate model for each item/question, which is suitable for scenarios such as essay scoring where items can be quite different from one another. However, these approaches have two limitations: 1) they fail to leverage item linkage for scenarios such as reading comprehension where multiple items may share a reading passage; 2) they are not scalable since storing one model per item becomes difficult when models have a large number of parameters. In this paper, we report our (grand prize-winning) solution to the National Assessment of Education Progress (NAEP) automated scoring challenge for reading comprehension. Our approach, in-context BERT fine-tuning, produces a single shared scoring model for all items with a carefully-designed input structure to provide contextual information on each item. We demonstrate the effectiveness of our approach via local evaluations using the training dataset provided by the challenge. We also discuss the biases, common error types, and limitations of our approach.
翻译:对不限名额学生的回答进行自动评分有可能大大降低人类的分级努力。在自动化评分方面最近的进展往往是基于预先培训的语言模式,如BERT和GPT作为评分模式的投入,多数现有办法为每个项目/问题分别培训一种模式,适合于诸如作文评分等情况,而项目彼此之间可能差别很大。然而,这些办法有两个局限性:(1)它们未能利用项目链接来应对诸如阅读理解等情景,使多个项目能够共享读取通道;(2)它们无法伸缩,因为当模型具有大量参数时,每个项目储存一个模型变得困难。我们在本文件中报告我们(获得大奖的)解决国家教育进展评估(NAEP)的自动评分办法,以理解为目的。我们的方法,在文本中BERT微调后,为所有项目制作了一个单一的共有评分模式,其投入结构经过仔细设计,以提供关于每个项目的背景信息。我们通过使用挑战提供的培训数据集进行的地方评价的方式展示了我们的方法的有效性。我们还讨论了我们的方法的偏差、常见的错误类型和局限性。