Decision trees and random forest remain highly competitive for classification on medium-sized, standard datasets due to their robustness, minimal preprocessing requirements, and interpretability. However, a single tree suffers from high estimation variance, while large ensembles reduce this variance at the cost of substantial computational overhead and diminished interpretability. In this paper, we propose Decision Tree Embedding (DTE), a fast and effective method that leverages the leaf partitions of a trained classification tree to construct an interpretable feature representation. By using the sample means within each leaf region as anchor points, DTE maps inputs into an embedding space defined by the tree's partition structure, effectively circumventing the high variance inherent in decision-tree splitting rules. We further introduce an ensemble extension based on additional bootstrap trees, and pair the resulting embedding with linear discriminant analysis for classification. We establish several population-level theoretical properties of DTE, including its preservation of conditional density under mild conditions and a characterization of the resulting classification error. Empirical studies on synthetic and real datasets demonstrate that DTE strikes a strong balance between accuracy and computational efficiency, outperforming or matching random forest and shallow neural networks while requiring only a fraction of their training time in most cases. Overall, the proposed DTE method can be viewed either as a scalable decision tree classifier that improves upon standard split rules, or as a neural network model whose weights are learned from tree-derived anchor points, achieving an intriguing integration of both paradigms.
翻译:决策树和随机森林因其鲁棒性、预处理需求低以及可解释性,在中等规模的标准数据集分类任务中仍具有高度竞争力。然而,单一决策树存在较高的估计方差,而大型集成方法虽能降低方差,却以显著的计算开销和可解释性下降为代价。本文提出决策树嵌入(DTE)方法,这是一种快速有效的方法,利用训练好的分类树的叶分区构建可解释的特征表示。通过使用每个叶区域内样本均值作为锚点,DTE将输入映射到由树的分区结构定义的嵌入空间中,有效规避了决策树分裂规则固有的高方差问题。我们进一步提出基于额外自助法树的集成扩展,并将生成的嵌入与线性判别分析结合用于分类。我们建立了DTE在总体层面的若干理论性质,包括其在温和条件下对条件密度的保持性以及对分类误差的表征。在合成和真实数据集上的实证研究表明,DTE在准确性和计算效率之间取得了良好平衡,在多数情况下优于或匹配随机森林和浅层神经网络,同时仅需其训练时间的一小部分。总体而言,所提出的DTE方法既可视为一种改进标准分裂规则的可扩展决策树分类器,也可视为一种权重从树衍生锚点学习的神经网络模型,实现了两种范式的有趣融合。