AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.
翻译:基于人工智能的代码生成器是一种新兴的解决方案,可以根据自然语言描述自动编写程序,这是通过使用深度神经网络(神经机器翻译,NMT)实现的。特别是,代码生成器已经被用于通过生成概念证明攻击的方式进行道德黑客和进攻性的安全测试。然而,代码生成器的评估仍面临着许多问题。目前的做法使用输出相似度度量,即通过自动度量计算生成的代码与基础真值参考的文本相似性。然而,不清楚要使用什么度量,哪种度量最适合特定的上下文。本研究分析了攻击代码生成器中大量的输出相似性度量。我们使用两个包含攻击性汇编和Python代码以及其英文描述的数据集,在两个最先进的NMT模型上应用了这些度量。我们将自动度量的估值与人类评估进行比较,并提供其优点和局限性的实际洞察。