Gradient boosted decision trees are some of the most popular algorithms in applied machine learning. They are a flexible and powerful tool that can robustly fit to any tabular dataset in a scalable and computationally efficient way. One of the most critical parameters to tune when fitting these models are the various penalty terms used to distinguish signal from noise in the current model. These penalties are effective in practice, but are lacking in robust theoretical justifications. In this paper we develop and present a novel theoretically justified hypothesis test of split quality for gradient boosted tree ensembles and demonstrate that using this method instead of the common penalty terms leads to a significant reduction in out of sample loss. Additionally, this method provides a theoretically well-justified stopping condition for the tree growing algorithm. We also present several innovative extensions to the method, opening the door for a wide variety of novel tree pruning algorithms.
翻译:梯度推动决策树是应用机器学习中最受欢迎的算法之一。 它们是一个灵活而有力的工具,能够以可缩放和计算效率高的方式牢固地适应任何表格数据集。 匹配这些模型时最关键的参数之一是用于区分信号和当前模型中噪音的各种惩罚术语。 这些惩罚在实际中是有效的,但却缺乏有力的理论理由。 在本文中,我们开发并提出了一个新的理论上合理的假设标准,即梯度推动树组分质量的理论上合理测试,并表明使用这种方法代替共同的惩罚条件可以大大减少样本损失。 此外,这种方法为树生长算法提供了一个理论上合理的停机条件。 我们还为该方法提供了一些创新的扩展,为各种新的树切削算法打开了大门。