Decision tree optimization is notoriously difficult from a computational perspective but essential for the field of interpretable machine learning. Despite efforts over the past 40 years, only recently have optimization breakthroughs been made that have allowed practical algorithms to find optimal decision trees. These new techniques have the potential to trigger a paradigm shift where it is possible to construct sparse decision trees to efficiently optimize a variety of objective functions without relying on greedy splitting and pruning heuristics that often lead to suboptimal solutions. The contribution in this work is to provide a general framework for decision tree optimization that addresses the two significant open problems in the area: treatment of imbalanced data and fully optimizing over continuous variables. We present techniques that produce optimal decision trees over a variety of objectives including F-score, AUC, and partial area under the ROC convex hull. We also introduce a scalable algorithm that produces provably optimal results in the presence of continuous variables and speeds up decision tree construction by several orders of magnitude relative to the state-of-the art.
翻译:尽管在过去40年中做出了努力,但直到最近才取得了优化突破,使实际算法能够找到最佳决策树。这些新技术有可能引发范式转变,因为可以建造稀疏的决策树,以便有效地优化各种客观功能,而不必依赖贪婪的分裂和剪裁杂草,往往导致不理想的解决办法。这项工作的贡献是为决策树优化提供一个总框架,解决该领域两个重要的开放问题:处理不平衡的数据和充分优化连续变量。我们提出了在包括F-score、AUC和ROC convex船体下部分区域在内的各种目标上产生最佳决策树的技术。我们还采用了一种可伸缩的算法,在存在连续变量和加快决策树的构建速度方面产生优美的结果,其规模与最新技术相比是几级的。