Automatic vulnerability detection on C/C++ source code has benefitted from the introduction of machine learning to the field, with many recent publications considering this combination. In contrast, assembly language or machine code artifacts receive little attention, although there are compelling reasons to study them. They are more representative of what is executed, more easily incorporated in dynamic analysis and in the case of closed-source code, there is no alternative. We propose ROMEO, a publicly available, reproducible and reusable binary vulnerability detection benchmark dataset derived from the Juliet test suite. Alongside, we introduce a simple text-based assembly language representation that includes context for function-spanning vulnerability detection and semantics to detect high-level vulnerabilities. Finally, we show that this representation, combined with an off-the-shelf classifier, compares favorably to state-of-the-art methods, including those operating on the full C/C++ code.
翻译:在C/C++源代码上自动检测脆弱性受益于对实地的机器学习,许多近期出版物都考虑了这种组合。相反,组装语言或机器代码文物很少受到重视,尽管有令人信服的理由加以研究。它们更能代表执行什么,更容易纳入动态分析,在封闭源代码的情况下,没有其他选择。我们提议采用由朱丽叶测试套件产生的ROMEO,这是一个公开的、可复制的和可重复使用的二元脆弱性检测基准数据集。此外,我们引入了一个简单的文本组合语言表达法,包括功能覆盖脆弱性检测和语义以探测高层次脆弱性的背景。最后,我们表明,这种表达法与现成的分类法相比,与现成的分类法比较优异,包括使用完整的C/C++代码的方法。