《源代码:软件工程中语言不可接受的方法和适用性》 (Authorship Attribution of Source Code: A Language-Agnostic Approach and Applicability in Software Engineering)

Authorship attribution (i.e., determining who is the author of a piece of source code) is an established research topic. State-of-the-art results for the authorship attribution problem look promising for the software engineering field, where they could be applied to detect plagiarized code and prevent legal issues. With this article, we first introduce a new language-agnostic approach to authorship attribution of source code. Then, we discuss limitations of existing synthetic datasets for authorship attribution, and propose a data collection approach that delivers datasets that better reflect aspects important for potential practical use in software engineering. Finally, we demonstrate that high accuracy of authorship attribution models on existing datasets drastically drops when they are evaluated on more realistic data. We outline next steps for the design and evaluation of authorship attribution models that could bring the research efforts closer to practical use for software engineering.

翻译：著作权归属(即确定谁是源代码的作者)是一个公认的研究专题。作者归属问题的最新结果对软件工程领域来说很有希望,可以应用这些结果来探测被窃的代码,防止法律问题。我们首先对源代码的作者归属采用新的语言不可知性方法。然后,我们讨论现有合成数据集对作者归属的限制,并提出一种数据收集方法,提供更好地反映对软件工程可能实际使用的重要方面的数据集。最后,我们表明,对现有数据集的作者归属模型的高度准确性在对更现实的数据进行评估时会急剧下降。我们概述了作者归属模型的设计和评估的下一步步骤,这些步骤可以使研究工作更接近软件工程的实际使用。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering