Decision trees are widely used for non-linear modeling, as they capture interactions between predictors while producing inherently interpretable models. Despite their popularity, performing inference on the non-linear fit remains largely unaddressed. This paper focuses on classification trees and makes two key contributions. First, we introduce a novel tree-fitting method that replaces the greedy splitting of the predictor space in standard tree algorithms with a probabilistic approach. Each split in our approach is selected according to sampling probabilities defined by an exponential mechanism, with a temperature parameter controlling its deviation from the deterministic choice given data. Second, while our approach can fit a tree that, with high probability, approximates the fit produced by standard tree algorithms at high temperatures, it is not merely predictive- unlike standard algorithms, it enables valid inference by taking into account the highly adaptive tree structure. Our method produces pivots directly from the sampling probabilities in the exponential mechanism. In theory, our pivots allow asymptotically valid inference on the parameters in the predictive fit, and in practice, our method delivers powerful inference without sacrificing predictive accuracy, in contrast to data splitting methods.
翻译:决策树因其能够捕捉预测变量间的交互作用并生成内在可解释的模型,被广泛应用于非线性建模。尽管决策树广受欢迎,但对其非线性拟合结果进行统计推断的问题在很大程度上仍未得到解决。本文聚焦于分类树,并做出两项关键贡献。首先,我们提出了一种新颖的树拟合方法,该方法采用概率化策略替代了标准树算法中对预测变量空间的贪婪划分。在我们的方法中,每次划分均依据指数机制定义的抽样概率进行选择,其中温度参数控制其与数据驱动的确定性选择之间的偏离程度。其次,虽然我们的方法在高温度下能够以高概率拟合出近似于标准树算法结果的树结构,但它不仅限于预测——与标准算法不同,该方法通过考虑高度自适应的树结构实现了有效的统计推断。我们的方法直接从指数机制的抽样概率中生成枢轴量。理论上,这些枢轴量允许对预测拟合中的参数进行渐近有效的推断;实践中,与数据分割方法相比,我们的方法在保持预测精度的同时提供了强有力的推断能力。