Binary code analysis is essential in scenarios where source code is unavailable, with extensive applications across various security domains. However, accurately resolving indirect call targets remains a longstanding challenge in maintaining the integrity of static analysis in binary code. This difficulty arises because the operand of a call instruction (e.g., call rax) remains unknown until runtime, resulting in an incomplete inter-procedural control flow graph (CFG). Previous approaches have struggled with low accuracy and limited scalability. To address these limitations, recent work has increasingly turned to machine learning (ML) to enhance analysis. However, this ML-driven approach faces two significant obstacles: low-quality callsite-callee training pairs and inadequate binary code representation, both of which undermine the accuracy of ML models. In this paper, we introduce CupidCall, a novel approach for resolving indirect calls using graph neural networks. Existing ML models in this area often overlook key elements such as data and code cross-references, which are essential for understanding a program's control flow. In contrast, CupidCall augments CFGs with cross-references, preserving rich semantic information. Additionally, we leverage advanced compiler-level type analysis to generate high-quality callsite-callee training pairs, enhancing model precision and reliability. We further design a graph neural model that leverages augmented CFGs and relational graph convolutions for accurate target prediction. Evaluated against real-world binaries from GitHub and the Arch User Repository on x86_64 architecture, CupidCall achieves an F1 score of 95.2%, outperforming state-of-the-art ML-based approaches. These results highlight CupidCall's effectiveness in building precise inter-procedural CFGs and its potential to advance downstream binary analysis and security applications.
翻译:在源代码不可用的场景下,二进制代码分析至关重要,其广泛应用于各类安全领域。然而,准确解析间接调用目标仍是维持二进制代码静态分析完整性的长期挑战。这一困难源于调用指令(如call rax)的操作数在运行时前未知,导致过程间控制流图(CFG)不完整。先前方法常受限于低准确率与有限的可扩展性。为应对这些局限,近期研究日益转向利用机器学习(ML)增强分析。然而,这种ML驱动方法面临两大障碍:低质量的调用点-被调用方训练对以及不充分的二进制代码表示,两者均损害ML模型的准确性。本文提出CupidCall,一种利用图神经网络解析间接调用的新方法。该领域现有ML模型常忽略关键元素(如数据与代码交叉引用),而这些对于理解程序控制流至关重要。相比之下,CupidCall通过交叉引用增强CFG,保留了丰富的语义信息。此外,我们利用先进的编译器级类型分析生成高质量的调用点-被调用方训练对,以提升模型的精确度与可靠性。我们进一步设计了一种图神经模型,利用增强CFG与关系图卷积实现准确的目标预测。在x86_64架构下对来自GitHub和Arch用户仓库的真实二进制文件进行评估,CupidCall的F1分数达到95.2%,优于当前最先进的基于ML的方法。这些结果凸显了CupidCall在构建精确过程间CFG方面的有效性,及其推动下游二进制分析与安全应用的潜力。