Regression trees are one of the oldest forms of AI models, and their predictions can be made without a calculator, which makes them broadly useful, particularly for high-stakes applications. Within the large literature on regression trees, there has been little effort towards full provable optimization, mainly due to the computational hardness of the problem. This work proposes a dynamic-programming-with-bounds approach to the construction of provably-optimal sparse regression trees. We leverage a novel lower bound based on an optimal solution to the k-Means clustering algorithm in 1-dimension over the set of labels. We are often able to find optimal sparse trees in seconds, even for challenging datasets that involve large numbers of samples and highly-correlated features.
翻译:回归树是最古老的人工智能模型之一,其预测结果可实现无需计算器,因此被广泛应用于高风险应用。尽管回归树的文献非常庞大,但由于问题的计算难度,很少有尝试实现完全可证的优化。该论文提出了一种动态规划加上搜索范围边界的方法来构建可证的最优稀疏回归树。我们利用了一种基于标签集上一维k均值聚类算法的最优解的新型下界。即使对于包含大量样本和高相关特征的挑战性数据集,我们通常也能在几秒钟内找到最优的稀疏树。