Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.
翻译:自产生长表达和陈述而不是单下方的神经代码模型出现以来,一个主要问题一直是可靠地评价其生成输出。在本文件中,我们提议代码BOSTScore:代码生成的自动评价指标,该指标以BERTScore为基础(Zhang等人,2020年)。CodBERTScole计算出,在生成的代码和参考代码中,每个代号之间,可以产生长表达和声明,而不是一个单下方方方方方方位的神经模型。此外,除了只对生成的代号进行编码,代码BERTScountScountre还将生成的代号编码编码编码编码编码编码编码编码编码编码编码编码编码编码编码。我们对代码编码编码编码编码编码的编码编码编码系统进行广泛评估,我们发现代码编码编码系统比所有现有参数的代号都更符合人类偏好,使用代码的代号代号为代码格式,在执行时,也更可能正确。最后,代码代码编号前BERTS-C-countcores-colm的编码系统可使用多用于多种多种版本的版本的版本。