When writing source code, programmers have varying levels of freedom when it comes to the creation and use of identifiers. Do they habitually use the same identifiers, names that are different to those used by others? Is it then possible to tell who the author of a piece of code is by examining these identifiers? If so, can we use the presence or absence of identifiers to assist in correctly classifying programs to authors? Is it possible to hide the provenance of programs by identifier renaming? In this study, we assess the importance of three types of identifiers in source code author classification for two different Java program data sets. We do this through a sequence of experiments in which we disguise one type of identifier at a time. These experiments are performed using as a tool the Source Code Author Profiles (SCAP) method. The results show that, although identifiers when examined as a whole do not seem to reflect program authorship for these data sets, when examined separately there is evidence that class names do signal the author of the program. In contrast, simple variables and method names used in Java programs do not appear to reflect program authorship. On the contrary, our analysis suggests that such identifiers are so common as to mask authorship. We believe that these results have applicability in relation to the robustness of code plagiarism analysis and that the underlying methods could be valuable in cases of litigation arising from disputes over program authorship.
翻译:当写入源代码时,程序员在创建和使用标识符时有不同程度的自由。 他们是否习惯地使用相同的标识符, 名称与其他人使用的名称不同? 那么能否通过检查这些标识符来判断代码的作者是谁? 如果是这样的话, 我们能否使用存在或缺少的标识符来帮助正确分类作者的程序? 是否有可能通过标识符重新命名来隐藏程序的来源? 在这次研究中, 我们评估两种不同的爪哇方案数据集在源代码作者分类中三种类型的标识符的重要性。 我们通过一系列实验来这样做, 我们每次伪装一种类型的标识符。 这些实验是以源代码作者特征谱(SCAP)方法作为工具进行的。 结果显示, 尽管从整体上检查时的标识符似乎没有反映这些数据集的方案作者的编制情况; 当单独研究时, 有证据表明, 类名称是程序作者的信号。 相比之下, Java 程序中使用的简单变量和方法名称似乎没有反映程序作者的情况。 相反, 我们的分析表明, 其可靠性分析结果显示, 可靠的模型是作者的常规关系。