The complexity of modern software has led to a drastic increase in the time and cost associated with detecting and rectifying software bugs. In response, researchers have explored various methods to automatically generate fixes for buggy code. However, due to the large combinatorial space of possible fixes for any given bug, few tools and datasets are available to evaluate model-generated fixes effectively. To address this issue, we introduce FixEval, a benchmark comprising of buggy code submissions to competitive programming problems and their corresponding fixes. FixEval offers an extensive collection of unit tests to evaluate the correctness of model-generated program fixes and assess further information regarding time, memory constraints, and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baseline and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FixEval provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced at https://github.com/mahimanzum/FixEval.
翻译:现代软件的复杂性导致检测和纠正软件缺陷所需的时间和成本急剧增加。为了应对这个问题,研究人员探索了各种方法来自动生成有缺陷代码的修复方案。然而,由于给定任何错误的大量组合空间,几乎没有可用于有效评估模型生成的修复的工具和数据集。为了解决这个问题,我们引入了 FixEval,一个基准测试,包括编程问题的有缺陷代码提交及其相应的修复方案。FixEval提供大量单元测试以评估模型生成的程序修复的正确性,并根据判决正确认可时间、内存限制和接受程度等信息。我们将两个学习了编程语言的 Transformer 语言模型作为基线,并使用基于匹配和执行的评估度量进行比较。我们的实验表明,基于匹配的指标不能准确反映模型生成的程序修复。与此同时,基于执行的方法能够通过所有为该解决方案明确设计的情况和场景来评估程序。因此,我们认为 FixEval 提供了一个迈向实际自动错误修复和模型生成的代码评估的步骤。数据集和模型开放源代码,位于 https://github.com/mahimanzum/FixEval。