Many researchers assume that, for software analytics, "more data is better". We write to show that, at least for learning defect predictors, this may not be true. To demonstrate this, we analyzed hundreds of popular GitHub projects. These projects ran for 84 months and contained 3,728 commits (median values). Across these projects, most of the defects occur very early in their life cycle. Hence, defect predictors learned from the first 150 commits and four months perform just as well as anything else. This means that, at least for the projects studied here, after the first few months, we need not continually update our defect prediction models. We hope these results inspire other researchers to adopt a "simplicity-first" approach to their work. Indeed, some domains require a complex and data-hungry analysis. But before assuming complexity, it is prudent to check the raw data looking for "short cuts" that simplify the whole analysis.
翻译:许多研究人员认为,对于软件分析来说,“更多的数据更好 ” 。 我们写作是为了表明,至少对于学习缺陷预测器来说,这也许不是事实。 为了证明这一点,我们分析了数百个受欢迎的GitHub项目。这些项目运行了84个月,包含3,728个承诺(中间值 ) 。在这些项目中,大多数缺陷都发生在它们的生命周期的很早阶段。因此,从最初150个承诺中学习的缺陷预测器和四个月的缺陷预测器同样有效。这意味着,至少对于在这里研究的项目来说,至少在头几个月后,我们不需要不断更新我们的缺陷预测模型。我们希望这些结果能激励其他研究人员对其工作采取“简单第一”的方法。事实上,有些领域需要复杂和数据饥饿的分析。但在假设复杂性之前,检查原始数据以“短切”简化整个分析是明智的。