The source code of a program not only defines its semantics but also contains subtle clues that can identify its author. Several studies have shown that these clues can be automatically extracted using machine learning and allow for determining a program's author among hundreds of programmers. This attribution poses a significant threat to developers of anti-censorship and privacy-enhancing technologies, as they become identifiable and may be prosecuted. An ideal protection from this threat would be the anonymization of source code. However, neither theoretical nor practical principles of such an anonymization have been explored so far. In this paper, we tackle this problem and develop a framework for reasoning about code anonymization. We prove that the task of generating a $k$-anonymous program -- a program that cannot be attributed to one of $k$ authors -- is not computable and thus a dead end for research. As a remedy, we introduce a relaxed concept called $k$-uncertainty, which enables us to measure the protection of developers. Based on this concept, we empirically study candidate techniques for anonymization, such as code normalization, coding style imitation, and code obfuscation. We find that none of the techniques provides sufficient protection when the attacker is aware of the anonymization. While we introduce an approach for removing remaining clues from the code, the main result of our work is negative: Anonymization of source code is a hard and open problem.
翻译:程序源代码不仅定义其语义学,而且还包含可以识别其作者的微妙线索。 几项研究显示, 这些线索可以通过机器学习自动提取, 并允许在数百个程序员中确定一个程序作者。 这种属性对反新闻检查和增强隐私技术的开发者构成重大威胁, 因为它们变得可以识别并可能被起诉。 理想的保护是源代码的匿名化。 但是, 至今尚未探索这种公开匿名的理论或实际原则。 在本文件中, 我们处理这一问题, 并开发一个关于编码匿名化的推理框架。 我们证明, 生成一个美元匿名程序的任务 -- -- 无法归结为美元作者之一的程序 -- -- 不易懂, 因而是研究的死路。 作为补救, 我们引入了一个称为美元- 不确定性的概念, 使我们能够测量对开发者的保护。 基于这个概念, 我们实验性地研究匿名化的候选技术, 诸如代码正常化、 风格化和代码化的主要代码化方法, 我们没有意识到, 我们的常规化, 我们的常规化的结果是消除了。