AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses output similarity metrics, i.e., automatic metrics that compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This work analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.
翻译:以 AI 为基础的代码生成器是一个新兴的自动写入程序解决方案,从自然语言描述开始,通过使用深层神经网络(神经机器翻译,NMT),自动写入程序。特别是,代码生成器通过生成概念攻击的证据,被用于进行道德黑客和攻击性安全测试。不幸的是,对代码生成器的评估仍面临若干问题。目前的做法使用类似性指标,即自动计量标准,将生成代码的文字相似性与地面真实参考值进行计算。然而,尚不清楚使用何种计量标准,以及哪种计量标准最适合特定环境。这项工作分析了攻击性代码生成器的大量类似产出计量标准。我们使用两个包含攻击性组装和Python代码的数据集及其英语描述,对两种最先进的NMT模型采用了计量标准。我们将自动计量的估计数与人类评估进行比较,并提供关于其优点和局限性的实用洞察力。