Source code repositories consist of large codebases, often containing error-prone programs. The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing these defects. Various methods exist to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible solutions for a particular bug, there are not many tools and datasets available to evaluate generated code effectively. In this work, we introduce FixEval, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. We introduce a rich test suite to evaluate and assess the correctness of model-generated program fixes. We consider two Transformer language models pretrained on programming languages as our baselines, and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately, while execution-based methods evaluate programs through all cases and scenarios specifically designed for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation.
翻译:源代码存储器由大型代码库组成, 通常包含易出错的程序。 软件的日益复杂导致识别和修复这些缺陷的时间和成本急剧上升。 存在自动生成错误代码修正的各种方法。 但是,由于对特定错误的可能解决方案的组合空间很大, 没有太多的工具和数据集可以有效评估生成的代码。 在此工作中, 我们引入了 FixEval, 这个基准包括错误代码提交到竞争性编程问题及其各自的修正中。 我们引入了一个内容丰富的测试套件, 以评价和评估模型生成程序修正的正确性。 我们认为, 两种变换语言模型预设了编程语言作为我们的基线, 并使用匹配和基于执行的评价指标进行比较。 我们的实验显示, 匹配基指标并不准确反映模型生成的程序修正, 而基于执行的方法则通过专门设计用于该解决方案的所有案例和情景来评估程序。 因此, 我们相信, 修补 Eval 提供了迈向真实世界自动错误校正和模型生成代码评估的一步。