Reverse engineering binaries is required to understand and analyse programs for which the source code is unavailable. Decompilers can transform the largely unreadable binaries into a more readable source code-like representation. However, reverse engineering is time-consuming, much of which is taken up by labelling the functions with semantic information. While the automated summarisation of decompiled code can help Reverse Engineers understand and analyse binaries, current work mainly focuses on summarising source code, and no suitable dataset exists for this task. In this work, we extend large pre-trained language models of source code to summarise decompiled binary functions. Furthermore, we investigate the impact of input and data properties on the performance of such models. Our approach consists of two main components; the data and the model. We first build CAPYBARA, a dataset of 214K decompiled function-documentation pairs across various compiler optimisations. We extend CAPYBARA further by generating synthetic datasets and deduplicating the data. Next, we fine-tune the CodeT5 base model with CAPYBARA to create BinT5. BinT5 achieves the state-of-the-art BLEU-4 score of 60.83, 58.82, and 44.21 for summarising source, decompiled, and synthetically stripped decompiled code, respectively. This indicates that these models can be extended to decompiled binaries successfully. Finally, we found that the performance of BinT5 is not heavily dependent on the dataset size and compiler optimisation level. We recommend future research to further investigate transferring knowledge when working with less expressive input formats such as stripped binaries.
翻译:需要反向工程二进制来理解和分析源代码不可用的程序。 调试器可以将基本无法读取的二进制转换成更易读的源代码代号表达式。 但是, 反向工程耗时, 大部分是用语义信息标注函数。 虽然解析代码的自动汇总可以帮助反向工程师理解和分析二进制代码, 目前的工作主要侧重于对源代码进行汇总, 并且没有适合此任务的数据集 。 在这项工作中, 我们推广了大型的源代码预培训语言模型, 以将解析的二进制函数转换成更易读的二进制代号。 此外, 我们调查输入和数据属性对此类模型绩效的影响, 我们的方法包括两个主要组成部分; 数据和模型。 我们首先建构了 214K 的解调函数文件对配对, 可以在各种编译器的优化中进行解析。 我们通过生成合成数据解析和解析数据, 进一步扩展 CAPYBARA, 将数据转换为我们无法将代码转换为 BINA5的代号基础模型, 和BYAADA, 解算为最终实现 BIN5的解算。