A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing CEMs can be categorized into match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) and execution-based CEMs (e.g., AvgPassRatio and Pass@k), but both of them suffer from some issues. The former only measures differences in surface form regardless of the functional equivalence of codes, while the latter has huge execution overheads, including collecting expensive test cases, resolving tedious execution dependencies, and enormous execution time. To address these issues, in this paper, we propose CodeScore, an efficient and effective CEM for code generation, which estimates test case PassRatio of generated code without executing code. We also present a framework named UniCE for training unified code evaluation models by learning code execution, i.e., learning PassRatio and Executability of generated code. In order to learn code execution comprehensively, we construct more than 100 test cases for each task in several popular benchmark datasets, covering MBPP, APPS, and HumanEval. Experimental results show that CodeScore has obtained a state-of-the-art correlation with execution-based CEMs. CodeScore is strongly correlated with AvgPassPatio, and binary CodeScore is moderately correlated with Pass@1. In particular, CodeScore eliminates the need for test cases and execution dependencies in inference, and CodeScore reduces execution time by three orders of magnitude compared to AvgPassPatio and Pass@1.
翻译:正确的代码评价衡量标准(CEM)深刻影响代码生成的演变,这是NLP和软件工程中的一个重要研究领域。当前的代码评价标准可以分为基于匹配的 CEM(如BLEU、Acccularacy和CobleU)和基于执行的 CEM(如AvgPassRatio和Pass@k),但两者都存在一些问题。前者仅衡量表面形式的差异,而不论代码的功能等同性,而后者则拥有巨大的执行间接费用,包括收集昂贵的测试案例、解决重复执行依赖性以及巨大的执行时间。为了解决这些问题,在本文件中,我们提议为代码生成建立一个基于匹配的 CocoS、高效有效的 CEMEM(例如,如AvgPassRatio) 和基于执行代码,我们提出了一个名为Unicole的框架,用于培训统一的代码评价模式,即基于常规执行,即学习使用Pacial-Prenti;为了全面学习代码执行,我们为每项任务建立100多个测试案例,并且通过大众基准S 测试S 测试S 测试S 测试S 测试S 测试S 显示一个数据库结果。