We explore the effectiveness and reliability of an artificial intelligence (AI)-based grading system for a handwritten general chemistry exam, comparing AI-assigned scores to human grading across various types of questions. Exam pages and grading rubrics were uploaded as images to account for chemical reaction equations, short and long open-ended answers, numerical and symbolic answer derivations, drawing, and sketching in pencil-and-paper format. Using linear regression analyses and psychometric evaluations, the investigation reveals high agreement between AI and human graders for textual and chemical reaction questions, while highlighting lower reliability for numerical and graphical tasks. The findings emphasize the necessity for human oversight to ensure grading accuracy, based on selective filtering. The results indicate promising applications for AI in routine assessment tasks, though careful consideration must be given to student perceptions of fairness and trust in integrating AI-based grading into educational practice.
翻译:本研究探讨了基于人工智能(AI)的手写普通化学考试评分系统的有效性和可靠性,通过比较AI评分与人工评分在不同题型中的表现。考试试卷和评分标准以图像形式上传,以涵盖化学反应方程式、开放式简答与详答题、数值与符号推导题、绘图及铅笔手绘草图等纸质考试内容。通过线性回归分析和心理测量学评估,研究发现AI与人工评分者在文本题和化学反应题上具有高度一致性,但在数值题和图形题上的可靠性较低。研究结果强调,基于选择性筛选的人工监督对于确保评分准确性是必要的。这些结果表明AI在常规评估任务中具有广阔的应用前景,但在将AI评分融入教育实践时,必须审慎考虑学生对公平性的认知和对系统的信任度。