Binary code similarity analysis (BCSA) is widely used for diverse security applications such as plagiarism detection, software license violation detection, and vulnerability discovery. Despite the surging research interest in BCSA, it is significantly challenging to perform new research in this field for several reasons. First, most existing approaches focus only on the end results, namely, increasing the success rate of BCSA, by adopting uninterpretable machine learning. Moreover, they utilize their own benchmark sharing neither the source code nor the entire dataset. Finally, researchers often use different terminologies or even use the same technique without citing the previous literature properly, which makes it difficult to reproduce or extend previous work. To address these problems, we take a step back from the mainstream and contemplate fundamental research questions for BCSA. Why does a certain technique or a feature show better results than the others? Specifically, we conduct the first systematic study on the basic features used in BCSA by leveraging interpretable feature engineering on a large-scale benchmark. Our study reveals various useful insights on BCSA. For example, we show that a simple interpretable model with a few basic features can achieve a comparable result to that of recent deep learning-based approaches. Furthermore, we show that the way we compile binaries or the correctness of underlying binary analysis tools can significantly affect the performance of BCSA. Lastly, we make all our source code and benchmark public and suggest future directions in this field to help further research.
翻译:尽管巴西航天局的研究兴趣日益浓厚,但由于若干原因,在这一领域开展新研究仍具有极大的挑战性。首先,大多数现有方法仅侧重于最终结果,即通过采用无法解释的机器学习,提高巴西航天局的成功率。此外,它们使用自己的基准,既不共享源代码,也不共享整个数据集。最后,研究人员往往使用不同的术语,甚至使用同样的技术,而不适当引用以前的文献,这就使得难以复制或扩展先前的工作。为了解决这些问题,我们从主流中倒退一步,考虑巴西航天局的基本研究问题。为什么某种技术或特征比其他方法显示更好的结果?具体地说,我们通过在大规模基准上利用可解释的特征工程,对巴西航天局使用的基本特征进行首次系统研究。我们的研究揭示了有关巴西航天局的各种有用见解。例如,我们展示了一个简单易懂的模型,但有一些基本特征,因此难以复制或扩展以前的工作。为了解决这些问题,我们从主流中倒退了一步一步一步,我们又能够对未来的基本研究方向做出一个比较。我们用什么方法来评估我们的公共基准,从而可以进一步评估我们的公共基准。