Reverse engineers benefit from the presence of identifiers such as function names in a binary, but usually these are removed for release. Training a machine learning model to predict function names automatically is promising but fundamentally hard: unlike words in natural language, most function names occur only once. In this paper, we address this problem by introducing eXtreme Function Labeling (XFL), an extreme multi-label learning approach to selecting appropriate labels for binary functions. XFL splits function names into tokens, treating each as an informative label akin to the problem of tagging texts in natural language. We relate the semantics of binary code to labels through DEXTER, a novel function embedding that combines static analysis-based features with local context from the call graph and global context from the entire binary. We demonstrate that XFL/DEXTER outperforms the state of the art in function labeling on a dataset of 10,047 binaries from the Debian project, achieving a precision of 83.5%. We also study combinations of XFL with alternative binary embeddings from the literature and show that DEXTER consistently performs best for this task. As a result, we demonstrate that binary function labeling can be effectively phrased in terms of multi-label learning, and that binary function embeddings benefit from including explicit semantic features.
翻译:反向工程师受益于二进制中函数名称等识别符号的存在, 但通常这些名称会被删除以释放。 训练机器学习模型以自动预测函数名称是很有希望的, 但从根本上来说是硬的: 与自然语言中的文字不同, 大多数功能名称只发生一次。 在本文中, 我们通过引入 eXtreme 函数标签( XFL) 来解决这个问题, 这是选择二进制函数的适当标签的极端多标签学习方法。 XFL 将名称分成符号, 将每个名称作为与自然语言中标记文本问题相似的信息标签处理 。 我们还通过 DEXTER 将二进制代码的语义与标签联系起来, DEXTER 是将静态分析基特征与当地背景相结合的新功能嵌入到整个二进制图和全球背景中。 我们证明 XFL/ DEXTER / DEXTER 超越了功能在Debian 项目中标定的10,047 二进制数据集的状态, 达到83.5%的精确度。 我们还研究将 XFLL 与文献中的替代二进制嵌嵌嵌入器和显示 DTER 持续地执行该任务的硬化的功能。