出自BLEU:我们应如何评估代号生成模型的质量? (Out of the BLEU: how should we assess quality of the Code Generation models?)

In recent years, researchers have created and introduced a significant number of various code generation models. As human evaluation of every new model version is unfeasible, the community adopted automatic evaluation metrics such as BLEU to approximate the results of human judgement. These metrics originate from the machine translation domain and it is unclear whether they are applicable for the code generation tasks and how well do they agree with the human evaluation on this task. There also are two metrics, CodeBLEU and RUBY, that were developed to estimate the similarity of code and take into account the code properties. However, for these metrics there are hardly any studies on their agreement with the human evaluation. Despite all that, minimal differences in the metric scores are used to claim superiority of some code generation models over the others. In this paper, we present a study on applicability of six metrics -- BLEU, ROUGE-L, METEOR, ChrF, CodeBLEU, RUBY -- for evaluation of the code generation models. We conduct a study on two different code generation datasets and use human annotators to assess the quality of all models run on these datasets. The results indicate that for the CoNaLa dataset of Python one-liners none of the metrics can correctly emulate human judgement on which model is better with $>95\%$ certainty if the difference in model scores is less than 5 points. For the HearthStone dataset, which consists of classes of particular structure, the difference in model scores of at least 2 points is enough to claim the superiority of one model over the other. Using our findings, we derive several recommendations on using metrics to estimate the model performance on the code generation task.

翻译：近年来,研究人员创建并引入了大量不同的代码生成模型。由于人类对每个新模式版本的人类评估都是不可行的, 社区采用了BLEU等自动评估指标, 以估计人类判断结果。这些指标来自机器翻译域, 不清楚它们是否适用于代码生成任务, 以及它们在多大程度上同意人类对这项任务的评估。还有两种指标, 代码生成U 和 RUBY, 用来估计代码的相似性, 并且考虑到代码属性。然而, 由于这些指标几乎没有任何关于它们与人类评估一致的研究。尽管如此, 衡量分中的最小差异被用来声称某些代码生成模型优于其他模型。在本文件中, 我们提交了一份关于六种指标( BLEU、 ROUGE- L、 METEOR、 ChrF、代码UCCLU、 RUBY)的适用性的研究报告, 用于评估代码生成模型模型的模型的模型的模型的相似性, 并且考虑到代码生成数据属性。我们对两个不同的数据分类进行了研究, 并且使用人类注释来评估所有模型的质量。在模型运行的模型中, 最差的模型中,, 最差的模型是 10 的成绩, 的成绩显示的成绩, 。