Before researchers rush to reason across all available data or try complex methods, perhaps it is prudent to first check for simpler alternatives. Specifically, if the historical data has the most information in some small region, perhaps a model learned from that region would suffice for the rest of the project. To support this claim, we offer a case study with 240 projects, where we find that the information in those projects "clump" towards the earliest parts of the project. A quality prediction model learned from just the first 150 commits works as well, or better than state-of-the-art alternatives. Using just this "early bird" data, we can build models very quickly and very early in the project life cycle. Moreover, using this early bird method, we have shown that a simple model (with just a few features) generalizes to hundreds of projects. Based on this experience, we doubt that prior work on generalizing quality models may have needlessly complicated an inherently simple process. Further, prior work that focused on later-life cycle data needs to be revisited since their conclusions were drawn from relatively uninformative regions. Replication note: all our data and scripts are available here: https://github.com/snaraya7/early-bird
翻译:在研究人员匆忙地对所有现有数据进行解释或尝试复杂的方法之前,也许明智的做法是首先检查更简单的替代方法。具体地说,如果历史数据在某些小地区拥有最多的信息,也许一个从该地区学到的模型就足以满足项目的其余部分。为了支持这一主张,我们提供了240个项目的案例研究,我们发现这些项目中的信息“聚集”到项目的最早部分。仅仅从最初的150个项目中获得的高质量预测模型也投入了工作,或者比最先进的替代方法更好。仅仅使用这一“早期鸟”数据,我们就可以在项目生命周期中非常迅速和非常早地建立模型。此外,我们使用这一早期的鸟方法,我们展示了一个简单的模型(只有几个特点)可以概括成百个项目。根据这一经验,我们怀疑以前关于普及质量模型的工作可能毫无疑问地复杂一个内在简单的过程。此外,由于从相对不具有说明性的区域得出结论,需要重新审视以前侧重于较晚寿命周期数据的工作。再重复说明:我们所有的数据和脚本都在这里提供: https://gibal/libarmas/coms。