Decompilation is the procedure of transforming binary programs into a high-level representation, such as source code, for human analysts to examine. While modern decompilers can reconstruct and recover much information that is discarded during compilation, inferring variable names is still extremely difficult. Inspired by recent advances in natural language processing, we propose a novel solution to infer variable names in decompiled code based on Masked Language Modeling, Byte-Pair Encoding, and neural architectures such as Transformers and BERT. Our solution takes \textit{raw} decompiler output, the less semantically meaningful code, as input, and enriches it using our proposed \textit{finetuning} technique, Constrained Masked Language Modeling. Using Constrained Masked Language Modeling introduces the challenge of predicting the number of masked tokens for the original variable name. We address this \textit{count of token prediction} challenge with our post-processing algorithm. Compared to the state-of-the-art approaches, our trained VarBERT model is simpler and of much better performance. We evaluated our model on an existing large-scale data set with 164,632 binaries and showed that it can predict variable names identical to the ones present in the original source code up to 84.15\% of the time.
翻译:解析是将二进制程序转换成高层次代表程序, 如源代码, 供人类分析师检查。 虽然现代解压缩器可以重建并恢复在编译过程中丢弃的许多信息, 但推断变量名称仍然极其困难。 受自然语言处理最近进展的启发, 我们提出了一个新颖的解决方案, 在基于隐形语言建模、 Byte- Pair 编码和神经结构( 如变异器和 BERT) 的解压缩代码中, 将变异名称推入解解的代码中。 我们的解决方案采用\ textit{raw} decompiler 输出, 作为输入, 并用我们提议的\ textitleit{ fineformation} 技术来丰富它。 受约束的蒙混版语言建模模型提出了预测原始变异名称的遮蔽符号数量的挑战 。 我们处理后处理算法的挑战。 与州处理法方法相比, 我们经过培训的 VarBERT 模型作为输入的不那么重要的代码, 作为投入, 我们经过训练过的VarBERTRET 模型比较简单, 并且展示了我们原有的原始版本版本的版本的版本代码。 我们的模型可以显示的原始版本。 。