Compiler optimization level recognition can be applied to vulnerability discovery and binary analysis. Due to the exists of many different compilation optimization options, the difference in the contents of the binary file is very complicated. There are thousands of compiler optimization algorithms and multiple different processor architectures, so it is very difficult to manually analyze binary files and recognize its compiler optimization level with rules. This paper first proposes a CNN-based compiler optimization level recognition model: BinEye. The system extracts semantic and structural differences and automatically recognize the compiler optimization levels. The model is designed to be very suitable for binary file processing and is easy to understand. We built a dataset containing 80,028 binary files for the model training and testing. Our proposed model achieves an accuracy of over 97%. At the same time, BinEye is a fully CNN-based system and it has a faster forward calculation speed, at least 8 times faster than the normal RNN-based model. Through our analysis of the model output, we successfully found the difference in assembly codes caused by the different compiler optimization level. This means that the model we proposed is interpretable. Based on our model, we propose a method to analyze the code differences caused by different compiler optimization levels, which has great guiding significance for analyzing closed source compilers and binary security analysis.
翻译:编译器优化度的识别可适用于脆弱性发现和二进制分析。 由于存在许多不同的编译优化选项, 二进制文件的内容差异非常复杂。 有数千个编译器优化算法和多个不同的处理器结构, 因此很难手动分析二进制文件, 并识别其包含规则的编译器优化水平。 此文件首先建议使用有线电视新闻网的编译器优化度识别模式 : BinEye 。 系统会提取语义和结构差异, 并自动识别编译器优化水平。 模型的设计非常适合二进制文件处理, 并且容易理解。 我们为模式培训和测试建立了一个包含80, 028个二进制文件的数据集。 我们提议的模型实现了超过97%的精度。 与此同时, BinEye 是一个完全基于CNNN的系统, 其前方计算速度比普通的 RNN 模式要快8倍。 通过对模型输出的分析, 我们成功地发现了由不同的编译器优化度水平造成的组代码差异。 这意味着我们提议的模型可以被解释, 用于对代码进行大程度的版本分析。