Decision trees are one of the most useful and popular methods in the machine learning toolbox. In this paper, we consider the problem of learning optimal decision trees, a combinatorial optimization problem that is challenging to solve at scale. A common approach in the literature is to use greedy heuristics, which may not be optimal. Recently there has been significant interest in learning optimal decision trees using various approaches (e.g., based on integer programming, dynamic programming) -- to achieve computational scalability, most of these approaches focus on classification tasks with binary features. In this paper, we present a new discrete optimization method based on branch-and-bound (BnB) to obtain optimal decision trees. Different from existing customized approaches, we consider both regression and classification tasks with continuous features. The basic idea underlying our approach is to split the search space based on the quantiles of the feature distribution -- leading to upper and lower bounds for the underlying optimization problem along the BnB iterations. Our proposed algorithm Quant-BnB shows significant speedups compared to existing approaches for shallow optimal trees on various real datasets.
翻译:决策树是机器学习工具箱中最有用和最受欢迎的方法之一。 在本文中, 我们考虑了学习最佳决策树的问题, 这是一种在规模上难以解决的组合优化问题。 文献中的一种共同做法是使用贪婪的超自然学, 这可能不是最佳的。 最近, 人们对于学习最佳决策树非常感兴趣, 使用各种方法( 例如, 基于整数编程、 动态编程) -- 实现计算可缩放性, 这些方法大多侧重于具有二进制特征的分类任务。 在本文中, 我们提出了一种新的离散优化方法, 其基础是分支和边框( BnB), 以获得最佳决策树。 与现有的定制方法不同, 我们所考虑的回归和分类任务有连续的特点。 我们的方法的基本思想是, 将搜索空间分割在特性分布的微量上, 导致与 BnB 的外观一样, 导致最优化问题背后的上下限。 我们提议的算法 Quant- BnB 显示与各种真实数据集中浅优化树的现有方法相比, 显著的速度加快。