Random forests are some of the most widely used machine learning models today, especially in domains that necessitate interpretability. We present an algorithm that accelerates the training of random forests and other popular tree-based learning methods. At the core of our algorithm is a novel node-splitting subroutine, dubbed MABSplit, used to efficiently find split points when constructing decision trees. Our algorithm borrows techniques from the multi-armed bandit literature to judiciously determine how to allocate samples and computational power across candidate split points. We provide theoretical guarantees that MABSplit improves the sample complexity of each node split from linear to logarithmic in the number of data points. In some settings, MABSplit leads to 100x faster training (an 99% reduction in training time) without any decrease in generalization performance. We demonstrate similar speedups when MABSplit is used across a variety of forest-based variants, such as Extremely Random Forests and Random Patches. We also show our algorithm can be used in both classification and regression tasks. Finally, we show that MABSplit outperforms existing methods in generalization performance and feature importance calculations under a fixed computational budget. All of our experimental results are reproducible via a one-line script at https://github.com/ThrunGroup/FastForest.
翻译:随机森林是当今最广泛使用的机器学习模型,特别是在需要解释的领域中。我们展示了加速随机森林培训和其他以树为基础的流行学习方法的算法。我们的算法核心是新颖的节点分割子路径,称为MABSplit,用于在建造决策树时有效找到分裂点。我们的算法从多武装土匪文献中借用技术,以便明智地确定如何在候选的分点之间分配样本和计算能力。我们提供了理论保证,使MABSplit在数据点数中提高了从线性到对数的每个节点的抽样复杂性。在有些环境中,MABSplit导致100x更快的培训(培训时间减少99%),而没有降低一般化性能。当MABSplit用于多种基于森林的变体时,例如极端随机森林和随机补丁。我们还展示了我们的算法可以用于分类和回归任务。最后,我们显示MABSpllit在一般化/正反偏差的当前方法上优于一般性表现/图式的计算结果。