Gradient Boosting Machines (GBM) are among the go-to algorithms on tabular data, which produce state of the art results in many prediction tasks. Despite its popularity, the GBM framework suffers from a fundamental flaw in its base learners. Specifically, most implementations utilize decision trees that are typically biased towards categorical variables with large cardinalities. The effect of this bias was extensively studied over the years, mostly in terms of predictive performance. In this work, we extend the scope and study the effect of biased base learners on GBM feature importance (FI) measures. We show that although these implementation demonstrate highly competitive predictive performance, they still, surprisingly, suffer from bias in FI. By utilizing cross-validated (CV) unbiased base learners, we fix this flaw at a relatively low computational cost. We demonstrate the suggested framework in a variety of synthetic and real-world setups, showing a significant improvement in all GBM FI measures while maintaining relatively the same level of prediction accuracy.
翻译:在表格数据上,渐进推力机(GBM)是算法的一部分,它产生许多预测任务的最新结果。尽管它受到欢迎,但GBM框架在基础学习者中存在着根本性缺陷。具体地说,大多数执行过程都使用典型偏向于绝对变量且具有巨大基本特征的决策树。多年来,这种偏差的影响得到了广泛的研究,主要是预测性能方面的研究。在这项工作中,我们扩大了有偏见的基础学习者对GBM特征重要性(FI)措施的影响的范围并进行了研究。我们表明,尽管这些实施过程显示出高度竞争性的预测性能,但令人惊讶的是,在FI中仍然存在着偏见。我们利用交叉有效的(CV)不带偏见的基础学习者,用相对较低的计算成本来修正这一缺陷。我们在各种合成和现实世界设置中展示了所建议的框架,表明所有GBM FI措施都有很大改进,同时保持了相对相同的预测准确度。