Developing models to automatically score students' written responses to science problems is critical for science education. However, collecting and labeling sufficient student responses for training models is time and cost-consuming. Recent studies suggest that pre-trained language models (PLMs) can be adapted to downstream tasks without fine-tuning with prompts. However, no research has employed such a prompt approach in science education. As student responses are presented with natural language, aligning the scoring procedure as the next sentence prediction task using prompts can skip the costly fine-tuning stage. In this study, we developed a zero-shot approach to automatically score student responses via Matching Exemplars as Next Sentence Prediction (MeNSP). This approach employs no training samples. We first apply MeNSP in scoring three assessment tasks of scientific argumentation and found machine-human scoring agreements, Cohen's Kappa ranges from 0.30 to 0.57, and F1 score ranges from 0.54 to 0.81. To improve the performance, we extend our research to the few-shots setting, either randomly selecting labeled student responses or manually constructing responses to fine-tune the models. We find that one task's performance is improved with more samples, Cohen's Kappa from 0.30 to 0.38, and F1 score from 0.54 to 0.59; for the two others, scoring performance is not improved. We also find that randomly selected few-shots perform better than the human expert-crafted approach. This study suggests that MeNSP can yield referable automatic scoring for student responses while significantly reducing the cost of model training. This method can benefit low-stakes classroom assessment practices in science education. Future research should further explore the applicability of the MeNSP in different types of assessment tasks in science education and improve the model performance.
翻译:对科学教育来说,开发模型,使学生对科学问题作出书面反应,是科学教育的关键。然而,收集和标出学生对培训模型作出充分反应的充足反应是时间和成本消耗的。最近的研究表明,培训前语言模型(PLMS)可以适应下游任务,而不必经过微调的微调。然而,没有研究在科学教育中采用这种迅速的方法。由于学生的回答是用自然语言提供的,因此,用提示来调整评分程序,因为下一个评分程序可以跳过昂贵的微调阶段。在本研究中,我们开发了一种零点方法,通过匹配Exemplals来自动评分学生对培训模型作出的反应。这种方法没有使用培训样本。我们首先应用MENSP(PM)来评分三个科学论证科学论证,发现机器-人类评分协议在0.30至0.57之间,F1评分从0.54到F0.81之间。为了提高学习成绩,我们从0.3到更准确地评分,我们从0.3到更精确的评分,我们从0.5的评分的评分,从0.3到更精确的评分的评分从0.3到更精确的评分从0.3到更精确的评分,从0.3到更精确的评分的评分,我们在的评分的评分的评分,我们从0.3,从0.3的评分从0.3到更能的评分的评分,从0.3到更能的评分,从0.3到更能的评分的评分,从0.3到更能的评分的评分的评分的评分的评分的评分的评分,从0.3到更能的评分,从0.3到更能的评分的评分的评分,我们从0.3到更能的评分的评分的评分,从0.3。