Measuring plagiarism in programming assignments is an essential task to the educational procedure. This paper discusses the methods of plagiarism and its detection in introductory programming course assignments written in C++. A small corpus of assignments is made publically available. A general framework to compute the similarity between a solution pair is developed that uses the three token-based similarity methods as features and predicts if the solution is plagiarized. The importance of each feature is also measured, which in return ranks the effectiveness of each method in use. Finally, the artificially generated dataset improves the results compared to the original data. We achieved an F1 score of 0.955 and 0.971 on original and synthetic datasets.
翻译:衡量方案拟定任务中的缺陷是教育程序的一项基本任务,本文件讨论了在C++编写的入门方案课程任务中,如何发现缺陷的方法及其探测方法,并公布一小批任务。开发了一个用来计算一对解决办法之间相似性的一般框架,即使用三种象征性相似方法作为特征,并预测解决办法是否被固定。还测量了每个特征的重要性,从而将每种使用方法的效用排序。最后,人工生成的数据集比原始数据改进了结果。我们在原始和合成数据集方面达到了0.955和0.971的F1分。