正在建构从代码 Evolution 历史中回归数据集 (Constructing Regression Dataset from Code Evolution History)

Bug datasets consisting of real-world bugs are important artifacts for researchers and programmers, which lay empirical and experimental foundation for various SE/PL research such as fault localization, software testing, and program repair. All known state-of-the-art datasets are constructed manually, which inevitably limits their scalability, representativeness, and the support for the emerging data-driven research. In this work, we propose an approach to automate the process of harvesting replicable regression bugs from the code evolutionary history. We focus on regression bug dataset, as they (1) manifest how a bug is introduced and fixed (as normal bugs), (2) support regression bug analysis, and (3) incorporate a much stronger specification (i.e., the original passing version) for general bug analysis. Technically, we address an information retrieval problem on code evolution history. Given a code repository, we search for regressions where a test can pass a regression-fixing commit, fail a regressioninducing commit, and pass a working commit. In this work, we address the challenges of (1) identifying potential regression-fixing commits from the code evolution history, (2) migrating the test and its code dependencies over the history, and (3) minimizing the compilation overhead during the regression search. We build our tool, RegMiner, which harvested 537 regressions over 66 projects for 3 weeks, created the largest replicable regression dataset within shortest period, to the best of our knowledge. Moreover, our empirical study on our regression dataset shows a gap between the popular regression fault localization techniques (e.g, delta-debugging) and the real fix, revealing new data-driven research opportunities.

翻译：由真实世界错误组成的错误数据集是研究人员和编程员的重要文物,它们为各种 SE/PL 研究,如错误本地化、软件测试、程序修理等,奠定了经验基础和实验基础。所有已知的最新数据集都是手工构建的,这不可避免地限制了它们的可缩放性、代表性和对数据驱动的新兴研究的支持。在这项工作中,我们建议一种方法,将从代码进化史中回收可复制回归错误的过程自动化。我们侧重于回归错误数据集,因为它们:(1) 表明一个错误是如何引入和固定的(正常错误),(2) 支持回归错误分析,(3) 为一般错误分析纳入一个更强的规格(即原始版本)。技术上,我们解决了代码进化历史历史历史中的信息检索问题。在代码存储处,我们寻找回归到测试能够通过回归固定承诺的回归过程,失败回归承诺,以及工作承诺。在这项工作中,我们应对的挑战是:(1) 确定一个在代码进化历史中进行最短的回溯性研究的时间, 将数据回归到最短的回归期,我们最深的回溯期,我们的数据测试和重新构建了历史。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【干货书】开放数据结构，Open Data Structures，337页pdf

专知会员服务

19+阅读 · 2021年9月17日

Effective.Modern.C++ 中英文版，334页pdf

专知会员服务

68+阅读 · 2020年11月4日

CVPR 2020 论文开源项目合集

专知会员服务

110+阅读 · 2020年3月12日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日