The increasing complexity of software has led to a drastic rise in time and costs for identifying and fixing bugs. Various approaches are explored in the literature to generate fixes for buggy code automatically. However, few tools and datasets are available to evaluate model-generated fixes effectively due to the large combinatorial space of possible fixes for a particular bug. In this work, we introduce FIXEVAL, a benchmark comprising buggy code submissions to competitive programming problems and their respective fixes. FIXEVAL is composed of a rich test suite to evaluate and assess the correctness of model-generated program fixes and further information regarding time and memory constraints and acceptance based on a verdict. We consider two Transformer language models pretrained on programming languages as our baselines and compare them using match-based and execution-based evaluation metrics. Our experiments show that match-based metrics do not reflect model-generated program fixes accurately. At the same time, execution-based methods evaluate programs through all cases and scenarios designed explicitly for that solution. Therefore, we believe FIXEVAL provides a step towards real-world automatic bug fixing and model-generated code evaluation. The dataset and models are open-sourced.\footnote{\url{https://github.com/mahimanzum/FixEval}}
翻译:软件日益复杂,导致识别和修正错误的时间和费用急剧上升。文献中探讨了各种办法,以自动修正错误代码。然而,由于对特定错误可能进行修正的庞大组合空间,很少有工具和数据集可用于有效评价模型生成的固定方法。在这项工作中,我们引入了FIXEVAL,这是一个基准,包括针对竞争性编程问题和各自的修正提交错误代码。FIXEVAL由一套丰富的测试套件组成,用来评价和评估模型生成程序修正的正确性,以及基于判定结果的关于时间和内存限制和接受的进一步信息。我们认为,有两个变换语言模型预先以编程语言作为我们的基线,并使用匹配基础和基于执行的评估尺度进行比较。我们的实验表明,基于匹配的参数并不准确反映模型生成的程序修正方法。与此同时,基于执行的方法通过明确为该解决方案设计的所有案例和情景来评估程序。因此,我们认为FIXEVAL提供了迈向真实世界自动错误修正和模型生成代码的模型评估的一步。数据设置和模型是开放源码/ASGU/ANSO。 和模型是开放的。