The ability to identify authors of computer programs based on their coding style is a direct threat to the privacy and anonymity of programmers. While recent work found that source code can be attributed to authors with high accuracy, attribution of executable binaries appears to be much more difficult. Many distinguishing features present in source code, e.g. variable names, are removed in the compilation process, and compiler optimization may alter the structure of a program, further obscuring features that are known to be useful in determining authorship. We examine programmer de-anonymization from the standpoint of machine learning, using a novel set of features that include ones obtained by decompiling the executable binary to source code. We adapt a powerful set of techniques from the domain of source code authorship attribution along with stylistic representations embedded in assembly, resulting in successful de-anonymization of a large set of programmers. We evaluate our approach on data from the Google Code Jam, obtaining attribution accuracy of up to 96% with 100 and 83% with 600 candidate programmers. We present an executable binary authorship attribution approach, for the first time, that is robust to basic obfuscations, a range of compiler optimization settings, and binaries that have been stripped of their symbol tables. We perform programmer de-anonymization using both obfuscated binaries, and real-world code found "in the wild" in single-author GitHub repositories and the recently leaked Nulled.IO hacker forum. We show that programmers who would like to remain anonymous need to take extreme countermeasures to protect their privacy.
翻译:根据编码风格识别计算机程序作者的能力直接威胁到程序员的隐私和匿名性。 虽然最近的工作发现源代码可以高度精确地归结于作者, 但源代码的归属似乎要困难得多。 源代码中的许多显著特征,例如变量名称,在编译过程中被删除, 编译优化可能会改变程序的结构, 已知在确定作者身份方面有用的进一步隐蔽特征。 我们从机器学习的角度, 检查程序员的隐私和匿名化, 使用一套新颖的功能, 其中包括通过将可执行文件的二进制到源代码中获得的特性。 我们从源代码作者属性的域中, 将一组强有力的技术从源代码作者属性的属性与集合中嵌入的文体表达方式, 导致大量程序员的匿名化成功。 我们从Google代码 Jam( Google Comm jam) 的数据评估了我们的方法, 获得了高达96%和83%的归属准确性, 与600个候选程序员一起。 我们展示了一套可追溯性的文件编写员的双读性目录, 例如, 将显示其最精确的版本的版本的版本的版本的版本的版本的系统到最精确度都显示其原始的版本的版本的系统, 将显示到最坚固化的系统到最坚固的版本的版本的版本的系统到最坚固的系统。