衡量与应用软件的编码挑战能力 (Measuring Coding Challenge Competence With APPS)

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20% of the test cases of introductory problems, so we find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.

翻译：现代机器学习模式虽然是现代社会最广泛应用的技能之一,但现代机器学习模式仍然无法对基本问题的解决方案进行编码。尽管它很重要,但评估代码生成的工作却少得惊人,而且很难准确评估代码生成的绩效。为了迎接这一挑战,我们引入了代码生成基准APS。与以前在更受限制的环境中开展的工作不同,我们的基准衡量模式是否有能力任意地制定自然语言规格并生成令人满意的Python代码。与公司如何评估候选软件开发者相似,我们随后通过检查测试案例来评估模型。我们的基准包括10,000个问题,从简单的一线解决方案到巨大的算法挑战。我们对GitHub和我们的培训集的大型语言模型进行微调,我们发现随着模型的改进,通俗错误的流行正在急剧减少。GPT-Neo等最近的模型可以通过大约20%的介绍性问题测试案例,因此我们发现机器学习模型现在正在开始学习如何编码。随着自动代码生成的社会重要性在未来几年里增加,我们的基准可以提供重要的跟踪进步的衡量尺度。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/