In this work, we review and evaluate a body of deep learning knowledge tracing (DLKT) models with openly available and widely-used data sets, and with a novel data set of students learning to program. The evaluated DLKT models have been reimplemented for assessing reproducibility and replicability of previously reported results. We test different input and output layer variations found in the compared models that are independent of the main architectures of the models, and different maximum attempt count options that have been implicitly and explicitly used in some studies. Several metrics are used to reflect on the quality of the evaluated knowledge tracing models. The evaluated knowledge tracing models include Vanilla-DKT, two Long Short-Term Memory Deep Knowledge Tracing (LSTM-DKT) variants, two Dynamic Key-Value Memory Network (DKVMN) variants, and Self-Attentive Knowledge Tracing (SAKT). We evaluate logistic regression, Bayesian Knowledge Tracing (BKT) and simple non-learning models as baselines. Our results suggest that the DLKT models in general outperform non-DLKT models, and the relative differences between the DLKT models are subtle and often vary between datasets. Our results also show that naive models such as mean prediction can yield better performance than more sophisticated knowledge tracing models, especially in terms of accuracy. Further, our metric and hyperparameter analysis shows that the metric used to select the best model hyperparameters has a noticeable effect on the performance of the models, and that metric choice can affect model ranking. We also study the impact of input and output layer variations, filtering out long attempt sequences, and non-model properties such as randomness and hardware. Finally, we discuss model performance replicability and related issues. Our model implementations, evaluation code, and data are published as a part of this work.
翻译:在这项工作中,我们审查和评价了一组深学习知识追踪(DLKT)模型(DLKT)模型,这些模型具有公开可用和广泛使用的数据集,并有一套学生学习编程的新数据集。经过评估的DLKT模型已经重新实施,以评估先前报告的结果的可复制性和可复制性。我们测试了在比较模型中发现的不同输入和产出层差异,这些模型独立于模型的主要结构,以及在某些研究中隐含和明确使用的不同的最大尝试计算选项。使用了若干项指标来反映经评估的知识追踪模型的质量。经过评估的知识追踪模型包括Vanilla-DKT、两个长期记忆深度知识追踪(LSTM-DKT)变量、两个动态关键值记忆网络(DKVMN)变量和自增强知识追踪追踪(SAKT)模型。我们评估了物流回归、Bayesian知识追踪模型(BKT)和简单的非学习模型作为基线。我们的结果还评估了DLT模型在一般不完善的非LKT模型中的影响,我们的数据最终的运行和相对差异也展示了我们的数据。