Mission-critical embedded software is critical to our society's infrastructure but can be subject to new security vulnerabilities as technology advances. When security issues arise, Reverse Engineers (REs) use Software Reverse Engineering (SRE) tools to analyze vulnerable binaries. However, existing tools have limited support, and REs undergo a time-consuming, costly, and error-prone process that requires experience and expertise to understand the behaviors of software and vulnerabilities. To improve these tools, we propose $\textit{cfg2vec}$, a Hierarchical Graph Neural Network (GNN) based approach. To represent binary, we propose a novel Graph-of-Graph (GoG) representation, combining the information of control-flow and function-call graphs. Our $\textit{cfg2vec}$ learns how to represent each binary function compiled from various CPU architectures, utilizing hierarchical GNN and the siamese network-based supervised learning architecture. We evaluate $\textit{cfg2vec}$'s capability of predicting function names from stripped binaries. Our results show that $\textit{cfg2vec}$ outperforms the state-of-the-art by $24.54\%$ in predicting function names and can even achieve $51.84\%$ better given more training data. Additionally, $\textit{cfg2vec}$ consistently outperforms the state-of-the-art for all CPU architectures, while the baseline requires multiple training to achieve similar performance. More importantly, our results demonstrate that our $\textit{cfg2vec}$ could tackle binaries built from unseen CPU architectures, thus indicating that our approach can generalize the learned knowledge. Lastly, we demonstrate its practicability by implementing it as a Ghidra plugin used during resolving DARPA Assured MicroPatching (AMP) challenges.
翻译:关键任务嵌入软件对我们的社会基础设施至关重要, 但随着技术的进步, 可能会受到新的安全弱点。 当出现安全问题时, 逆向工程师( REs) 使用软件反向工程( SRE) 工具来分析脆弱的二进制。 但是, 现有工具的支持有限, 而 REs 则经历一个耗时、 昂贵和易出错的过程, 需要经验和专门知识来理解软件和脆弱性的行为。 为了改进这些工具, 我们建议 $\ textit{ cffg2vec}, 一个基于等级图表神经网络( GNNN) 的方法。 为了代表二进制, 我们建议使用软件反向工程( GOG) 工具来分析脆弱的二进制系统。 我们的结果显示, $\ text{ c2\ gev2c} 我们的每进制硬化系统功能, 可以用等级 GNNNNP( text_ fincreferation) 来显示我们现有的二进化工具的能力。 我们的结果显示, 需要不断的C- true- true- true- train_ train_ train_ train_ train_ train_ real_ real_ a lading lagment lax a mode lagment mode modustration.