利用专题示范方法分组介绍计算机科学活动 (Clustering Introductory Computer Science Exercises Using Topic Modeling Methods)

Manually determining concepts present in a group of questions is a challenging and time-consuming process. However, the process is an essential step while modeling a virtual learning environment since a mapping between concepts and questions using mastery level assessment and recommendation engines are required. We investigated unsupervised semantic models (known as topic modeling techniques) to assist computer science teachers in this task and propose a method to transform Computer Science 1 teacher-provided code solutions into representative text documents, including the code structure information. By applying non-negative matrix factorization and latent Dirichlet allocation techniques, we extract the underlying relationship between questions and validate the results using an external dataset. We consider the interpretability of the learned concepts using 14 university professors' data, and the results confirm six semantically coherent clusters using the current dataset. Moreover, the six topics comprise the main concepts present in the test dataset, achieving 0.75 in the normalized pointwise mutual information metric. The metric correlates with human ratings, making the proposed method useful and providing semantics for large amounts of unannotated code.

翻译：手工确定一组问题中存在的概念是一个具有挑战性和耗时的过程。然而,由于需要用掌握级别的评估和建议引擎对概念和问题进行测绘,这一过程是模拟虚拟学习环境的关键一步,因为需要用掌握级别的评估和建议引擎对概念和问题进行测绘;我们调查了未经监督的语义模型(称为专题模型技术),以协助计算机科学教师完成这项任务,并提出了将计算机科学1号教师提供的代码解决方案转换成具有代表性的文本文件的方法,包括代码结构信息。我们通过应用非负矩阵化和潜伏的dirichlet分配技术,提取问题之间的根本关系,并利用外部数据集验证结果。我们考虑了利用14个大学教授的数据对所学概念进行解释的可能性,结果证实了使用现有数据集的6个语义一致性组。此外,这6个专题包括测试数据集中存在的主要概念,在标准化点的相互信息指标中实现了0.75。指标与人类评级相关,使拟议的方法有用,并为大量无注释代码提供语义学。