Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3,200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.
翻译:书写软件开发是进攻性安全分析家调查和防止攻击的一个重要做法,特别是贝壳代码特别耗费时间和技术挑战,因为它们是用装配语言写的。在这项工作中,我们处理的是自动生成贝壳代码的任务,完全从自然语言描述开始,提出基于神经机器翻译(NMT)的方法。然后,我们用新颖的数据集(Shellcode_IA32)提出实证研究,该数据集包括公共数据库3 200个真正的Linux/x86贝壳代码的组装代码片,用自然语言附加注释。此外,我们提出了评估NMT在生成贝壳代码时准确性的新指标。实证分析表明,NMT可以非常精确地从自然语言中生成组装代码片,在许多情况下,可以产生没有误差的完整贝壳代码。