Developing models to automatically score students' written responses to science problems is critical for science education. However, collecting and labeling sufficient student responses for training models is time and cost-consuming. Recent studies suggest that pre-trained language models (PLMs) can be adapted to downstream tasks without fine-tuning with prompts. However, no research has employed such a prompt approach in science education. As student responses are presented with natural language, aligning the scoring procedure as the next sentence prediction task using prompts can skip the costly fine-tuning stage. In this study, we developed a zero-shot approach to automatically score student responses via Matching Exemplars as Next Sentence Prediction (MeNSP). This approach employs no training samples. We first apply MeNSP in scoring three assessment tasks of scientific argumentation and found machine-human scoring agreements, Cohen's Kappa ranges from 0.30 to 0.57, and F1 score ranges from 0.54 to 0.81. To improve the performance, we extend our research to the few-shots setting, either randomly selecting labeled student responses or manually constructing responses to fine-tune the models. We find that one task's performance is improved with more samples, Cohen's Kappa from 0.30 to 0.38, and F1 score from 0.54 to 0.59; for the two others, scoring performance is not improved. We also find that randomly selected few-shots perform better than the human expert-crafted approach. This study suggests that MeNSP can yield referable automatic scoring for student responses while significantly reducing the cost of model training. This method can benefit low-stakes classroom assessment practices in science education. Future research should further explore the applicability of the MeNSP in different types of assessment tasks in science education and improve the model performance.
翻译:开发自动评分模型以评估学生成绩对于科学教育至关重要。然而,收集和标记足够的学生响应以训练模型时间和成本昂贵。最近的研究表明,预训练语言模型(PLMs)可以在不使用提示进行微调的情况下适应下游任务。但是,在科学教育中尚未进行此类提示研究。由于学生响应以自然语言呈现,因此将评分过程与使用提示的下一个句子预测任务相一致可以跳过昂贵的微调阶段。在本研究中,我们开发了一种零样本方法来自动评分学生响应,该方法称为匹配示例作为下一个句子预测(MeNSP)。该方法不使用任何样本进行训练。我们首先将MeNSP应用于科学论证的三个评估任务中,并发现机器-人类评分协议,Cohen's Kappa从0.30到0.57,F1分数从0.54到0.81不等。为了提高性能,我们将研究扩展到数个样本的情况,随机选择带标签的学生响应或手动构建响应来微调模型。我们发现,一个任务的性能随着样本数量的增加而改善,Cohen's Kappa从0.30提高到0.38,F1分数从0.54提高到0.59;对于其他两个任务,评分性能没有改善。我们还发现,随机选择的少量样本比人类专家制作的方法表现更好。本研究表明,MeNSP可以产生可参考的学生响应自动评分,同时显着降低了模型训练成本。该方法可以受益于科学教育中低风险的课堂评估实践。未来的研究应进一步探索MeNSP在科学教育不同类型的评估任务中的适用性,并提高模型性能。