关于将决策树与添加模型数据相匹配的告诫性故事:一般化下限 (A cautionary tale on fitting decision trees to data from additive models: generalization lower bounds)

Decision trees are important both as interpretable models amenable to high-stakes decision-making, and as building blocks of ensemble methods such as random forests and gradient boosting. Their statistical properties, however, are not well understood. The most cited prior works have focused on deriving pointwise consistency guarantees for CART in a classical nonparametric regression setting. We take a different approach, and advocate studying the generalization performance of decision trees with respect to different generative regression models. This allows us to elicit their inductive bias, that is, the assumptions the algorithms make (or do not make) to generalize to new data, thereby guiding practitioners on when and how to apply these methods. In this paper, we focus on sparse additive generative models, which have both low statistical complexity and some nonparametric flexibility. We prove a sharp squared error generalization lower bound for a large class of decision tree algorithms fitted to sparse additive models with $C^1$ component functions. This bound is surprisingly much worse than the minimax rate for estimating such sparse additive models. The inefficiency is due not to greediness, but to the loss in power for detecting global structure when we average responses solely over each leaf, an observation that suggests opportunities to improve tree-based algorithms, for example, by hierarchical shrinkage. To prove these bounds, we develop new technical machinery, establishing a novel connection between decision tree estimation and rate-distortion theory, a sub-field of information theory.

翻译：决策树是重要的,因为它是可解释的模型,可以用于高层次决策,也可以是任意森林和梯度提升等混合方法的构件。然而,它们的统计特性并没有得到很好理解。最引人注意的先前著作侧重于在古典非参数回归环境下为CART提供点点的一致保障。我们采取了不同的方法,并倡导研究决策树相对于不同基因回归模型的概括性表现。这使我们得以得出其诱导偏差,即算法(或没有)对新数据进行概括化的假设,从而指导从业者何时和如何应用这些方法。在本文中,我们侧重于稀有的添加型基因化模型,既具有低的统计复杂性,又具有一些非参数性的灵活性。我们采取不同的办法,研究决策树的大规模定型算法的概括性较低。这比估算这种稀有的添加型模型的微缩缩缩缩率要差得多。效率不是因为贪婪,而是因为理论性在何时和如何应用这些方法。在本文件中,我们通过平均反应来测测算这些树级结构时,我们只能证明这些结构的机率。