The ability to compute similarity scores of binary code at the function level is essential for cyber security. A single binary file can contain tens of thousands of functions. A deployable learning framework for cybersecurity applications needs to work not only accurately but also efficiently with large amounts of data. Traditional methods suffer from two drawbacks. First, it is very difficult to annotate different pairs of functions with accurate labels. These supervised learning methods can easily be overtrained with inaccurate labels. The second is that they either use the pre-trained encoder or use the fine-grained graph comparison. However, these methods have shortcomings in terms of time or memory consumption. We focus on large-scale Binary Code Similarity Detection (BCSD) and to mitigate the traditional problems, we propose GraphMoco: a graph momentum contrast model that uses multimodal structure information for large-scale binary function representation learning. We take an unsupervised learning approach and make full use of the structural information in the binary code. It does not require manually labelled similar or dissimilar information. Our models perform efficiently on large amounts of training data. Our experimental results show that our method outperforms the state-of-the-art in terms of accuracy.
翻译:暂无翻译