The ability to solve problems is a hallmark of intelligence and has been an enduring goal in AI. AI systems that can create programs as solutions to problems or assist developers in writing programs can increase productivity and make programming more accessible. Recently, pre-trained large language models have shown impressive abilities in generating new codes from natural language descriptions, repairing buggy codes, translating codes between languages, and retrieving relevant code segments. However, the evaluation of these models has often been performed in a scattered way on only one or two specific tasks, in a few languages, at a partial granularity (e.g., function) level and in many cases without proper training data. Even more concerning is that in most cases the evaluation of generated codes has been done in terms of mere lexical overlap rather than actual execution whereas semantic similarity (or equivalence) of two code segments depends only on their ``execution similarity'', i.e., being able to get the same output for a given input.
翻译:摘要:解决问题的能力是智能的一个标志,并且一直是人工智能的目标。能够创建解决问题的程序或协助开发人员编写程序的人工智能系统可以提高生产力并使编程更加易于操作。近年来,预训练的大型语言模型已经在从自然语言描述生成新代码、修复缺陷代码、在多语言之间进行翻译以及检索相关代码段方面展现出令人印象深刻的能力。然而,这些模型的评估通常是零散的,只针对一两项特定任务,在几种语言中进行评估,只在部分粒度(例如函数)级别上进行评估,在许多情况下,没有适当的训练数据。更令人担忧的是在大多数情况下,对生成代码的评估仅使用了词汇重叠而不是实际执行,而两个代码段的语义相似性(或等价性)仅取决于它们的“执行相似性”,即能够为给定的输入获得相同的输出。