ReCode:对代码生成模型的强力评估 (ReCode: Robustness Evaluation of Code Generation Models)

Shiqi Wang,Zheng Li,Haifeng Qian,Chenghao Yang,Zijian Wang,Mingyue Shang,Varun Kumar,Samson Tan,Baishakhi Ray,Parminder Bhatia,Ramesh Nallapati,Murali Krishna Ramanathan,Dan Roth,Bing Xiang

from arxiv, Code and data available at https://github.com/amazon-science/recode

Code generation models have achieved impressive performance. However, they tend to be brittle as slight edits to a prompt could lead to very different generations; these robustness properties, critical for user experience when deployed in real-life applications, are not well understood. Most existing works on robustness in text or code tasks have focused on classification, while robustness in generation tasks is an uncharted area and to date there is no comprehensive benchmark for robustness in code generation. In this paper, we propose ReCode, a comprehensive robustness evaluation benchmark for code generation models. We customize over 30 transformations specifically for code on docstrings, function and variable names, code syntax, and code format. They are carefully designed to be natural in real-life coding practice, preserve the original semantic meaning, and thus provide multifaceted assessments of a model's robustness performance. With human annotators, we verified that over 90% of the perturbed prompts do not alter the semantic meaning of the original prompt. In addition, we define robustness metrics for code generation models considering the worst-case behavior under each type of perturbation, taking advantage of the fact that executing the generated code can serve as objective evaluation. We demonstrate ReCode on SOTA models using HumanEval, MBPP, as well as function completion tasks derived from them. Interesting observations include: better robustness for CodeGen over InCoder and GPT-J; models are most sensitive to syntax perturbations; more challenging robustness evaluation on MBPP over HumanEval.

翻译：代码生成模型已经取得了令人印象深刻的绩效。但是,它们往往会变得不易理解,因为对代码生成模型进行轻微的细微编辑,可能会导致几代人;这些稳健性特性,对于用户在实际应用中应用时的经历至关重要,并没有得到很好理解。大多数关于文本或代码任务中稳健性的现有工作都侧重于分类,而生成任务中的稳健性是一个未知领域,而迄今为止还没有关于代码生成的稳健性的全面基准。在本文件中,我们提议为代码生成模型建立一个全面的稳健性评价基准ReCode。此外,我们为代码生成模型定制了30多个稳健性指标,具体针对关于 docstring、函数和变量名称、代码稳健性以及代码格式的代码。它们经过仔细设计,在真实性编码编码操作中是自然自然的,保存原有的语义含义,从而对模型的稳健健性性进行多方面的评估。我们核实了90%以上的杂性提示不会改变最初的语义性含义。此外,我们为代码生成模型的稳健性指标是考虑到每类最坏的观察行为,我们用SOBB的精准性规则来进行更精确的计算。