Bug datasets consisting of real-world bugs are important artifacts for researchers and programmers, which lay empirical and experimental foundation for various SE/PL research such as fault localization, software testing, and program repair. All known state-of-the-art datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolutionary history. We focus on regression bug dataset, as they (1) manifest how a bug is introduced and fixed (as normal bugs), (2) support regression bug analysis, and (3) incorporate a much stronger specification (i.e., the original passing version) for general bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regressioninducing commit, and pass a working commit. In this work, we address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 537 regressions over 66 projects for 3 weeks, created the largest replicable regression dataset within shortest period, to the best of our knowledge. Moreover, our empirical study on our regression dataset shows a gap between the popular regression fault localization techniques (e.g, delta-debugging) and the real fix, revealing new data-driven research opportunities.
翻译:由真实世界错误组成的错误数据集是研究人员和编程员的重要文物,它们为各种 SE/PL 研究,如错误本地化、软件测试、程序修理等,奠定了经验基础和实验基础。所有已知的最新数据集都是手工构建的,这不可避免地限制了它们的可缩放性、代表性和对数据驱动的新兴研究的支持。在这项工作中,我们建议一种方法,将从代码进化史中回收可复制回归错误的过程自动化。我们侧重于回归错误数据集,因为它们:(1) 表明一个错误是如何引入和固定的(正常错误),(2) 支持回归错误分析,(3) 为一般错误分析纳入一个更强的规格(即原始版本)。技术上,我们解决了代码进化历史历史历史中的信息检索问题。在代码存储处,我们寻找回归到测试能够通过回归固定承诺的回归过程,失败回归承诺,以及工作承诺。在这项工作中,我们应对的挑战是:(1) 确定一个在代码进化历史中进行最短的回溯性研究的时间, 将数据回归到最短的回归期,我们最深的回溯期,我们的数据测试和重新构建了历史。