AI-based code generators are an emerging solution for automatically writing programs starting from descriptions in natural language, by using deep neural networks (Neural Machine Translation, NMT). In particular, code generators have been used for ethical hacking and offensive security testing by generating proof-of-concept attacks. Unfortunately, the evaluation of code generators still faces several issues. The current practice uses automatic metrics, which compute the textual similarity of generated code with ground-truth references. However, it is not clear what metric to use, and which metric is most suitable for specific contexts. This practical experience report analyzes a large set of output similarity metrics on offensive code generators. We apply the metrics on two state-of-the-art NMT models using two datasets containing offensive assembly and Python code with their descriptions in the English language. We compare the estimates from the automatic metrics with human evaluation and provide practical insights into their strengths and limitations.
翻译:以 AI 为基础的代码生成器是一个新兴的自动写入程序解决方案,从自然语言描述开始,通过使用深层神经网络(神经机器翻译,NMT),自动写入程序;特别是,代码生成器通过生成概念攻击的证据,被用于进行道德黑客和攻击性安全测试;不幸的是,代码生成器的评估仍面临若干问题;目前的做法使用自动度量,将生成的代码的文字相似性与地面真实参考值进行计算;然而,尚不清楚应使用何种计量标准,以及哪一种计量标准最适合具体环境。本实际经验报告分析了一套关于攻击性代码生成器的大型类似输出度量度指标。我们使用两种最先进的NMT模型使用包含攻击性组装和Python代码的数据集及其英语描述。我们将自动计量值的估计数与人类评估进行比较,并提供关于其优点和局限性的实用洞察力。