Decision trees with binary splits are popularly constructed using Classification and Regression Trees (CART) methodology. For regression models, this approach recursively divides the data into two near-homogenous daughter nodes according to a split point that maximizes the reduction in sum of squares error (the impurity) along a particular variable. This paper aims to study the statistical properties of regression trees constructed with CART methodology. In doing so, we find that the training error is governed by the Pearson correlation between the optimal decision stump and response data in each node, which we bound by constructing a prior distribution on the split points and solving a nonlinear optimization problem. We leverage this connection between the training error and Pearson correlation to show that CART with cost-complexity pruning achieves an optimal complexity/goodness-of-fit tradeoff when the depth scales with the logarithm of the sample size. Data dependent quantities, which adapt to the dimensionality and latent structure of the regression model, are seen to govern the rates of convergence of the prediction error.
翻译:使用分类和递减树( CART) 方法来构造带有二进制分解的决策树。 对于回归模型, 这种方法会根据一个分点, 最大限度地减少某变量的平方差错( 杂质) 的总和。 本文旨在研究使用 CART 方法构造的回归树的统计属性。 这样, 我们发现培训错误受每个节点的最佳决定立方和响应数据之间的皮尔逊相关关系制约, 我们通过在分割点上建立先前分布点和解决非线性优化问题来约束这些数据。 我们利用培训错误和Pearson相关联系来显示, 成本- 相容性调整的 CART 达到最佳的复杂度/ 良好性交易。 与样本大小对数的深度尺度相比, 取决于数据的数量, 与回归模型的尺寸和潜在结构相适应, 可以调节预测错误的趋同率 。