Managers and practitioners become dubious about software analytics when its conclusions keep changing as we look at new projects. GENERAL is a new approach for quickly finding conclusions that generalize across hundreds of projects. This algorithm (a) removes spurious attributes via feature selection; (b) fixes training data imbalance via synthetic instances; (c) recursively clusters the project data; (d) finds the best model within any cluster, then promotes it up the cluster tree; (e) returns the model promoted to the top. GENERAL is much faster than prior methods (4.8 hours versus 204 hours our case studies) and theoretically scales better (O(N^2/m) versus O(N^2), which is a large reduction since often we find m>20 clusters). When tested on 756 Github projects, a single defect prediction model generalized over all those projects while also being useful and insightful and generalizable; i.e. that model worked just as well as 756 separate models learned from each project; and that model succinctly show what key factors most contributed to defects. Hence, when exploring hundreds of projects, we endorse GENERAL reasoning.
翻译:当我们看新的项目时,当软件分析的结论不断发生变化时,管理人员和从业者对软件分析感到疑惑。一般是一种新的方法,可以迅速找到贯穿数百个项目的结论。这种算法(a) 通过特征选择消除虚假的属性;(b) 通过合成实例修正培训数据不平衡;(c) 通过合成实例纠正培训数据不平衡;(c) 将项目数据反复分组;(d) 在任何集群中找到最佳模型,然后将模型提升到集群树上;(e) 将推广的模型返回到顶端。一般比以前的方法(4.8小时比204小时我们案例研究)和理论上的尺度(O(N)2/m)比O(O(N)2)要快得多,而理论尺度(O(O(N)2/m)比O(N)2)要好得多,因为我们经常发现 m > 20集群。 当对756 Github 项目进行测试时,一个单一的缺陷预测模型将所有项目都加以普及,同时有用、有洞察力和可概括化;即该模型只起作用和从每个项目中学习756个不同的模型;该模型简洁地显示哪些是造成缺陷的主要因素。因此,我们在探索数百个项目时,我们赞同一般推理理理。