Recently, we can notice a transition to data-driven techniques in Automated Program Repair (APR), in particular towards deep neural networks. This entails training on hundreds of thousands or even millions of non-executable code fragments. We would like to bring more attention to an aspect of code often neglected in Neural Program Repair (NPR), namely its execution. Code execution has several significant advantages. It allows for test-based evaluation of candidate fixes and can provide valuable information to aid repair. In this work we present a fully executable dataset of 450,000 small buggy/fixed program pairs originally submitted to programming competition websites written in eight different programming languages. Along with the dataset we provide infrastructure to compile, safely execute and test programs as well as fine-grained bug-type labels. To give a point of reference, we provide basic evaluation results for two baselines, one based on a generate-and-validate approach and one on deep learning. With this dataset we follow several goals: we want to lift Neural Program Repair beyond fully static code representations, foster the use of execution-based features and, by including several different languages, counterbalance the predominance of Java in the current landscape of APR datasets and benchmarks.
翻译:最近,我们可以注意到自动程序修复(APR)领域中向基于数据的技术的过渡,特别是向深度神经网络的过渡。这意味着需要在数十万或甚至数百万个不可执行的代码片段上进行训练。我们希望更多地关注神经程序修复(NPR)中经常被忽视的代码执行方面。代码执行具有几个重要的优点。它允许测试候选修复方案并提供有价值的信息来帮助修复。在这项工作中,我们提供了一个完全可执行的数据集,其中包含了由八种不同编程语言编写的45万个小型有错误/已修复程序对,这些程序最初是在编程竞赛网站上提交的。除了数据集外,我们还提供基础设施来编译、安全执行和测试程序,以及细粒度的错误类型标签。为了提供一个参考点,我们为两个基线提供了基本的评估结果,其中一个基于生成和验证方法,另一个基于深度学习。通过这个数据集,我们追求几个目标:我们想将神经程序修复提升到完全静态代码表示之外,促进基于执行的特征的使用,并通过包括几种不同的语言,抵消当前APR数据集和基准测试中Java的支配地位。