Decision trees are widely used for their low computational cost, good predictive performance, and ability to assess the importance of features. Though often used in practice for feature selection, the theoretical guarantees of these methods are not well understood. We here obtain a tight finite sample bound for the feature selection problem in linear regression using single-depth decision trees. We examine the statistical properties of these "decision stumps" for the recovery of the $s$ active features from $p$ total features, where $s \ll p$. Our analysis provides tight sample performance guarantees on high-dimensional sparse systems which align with the finite sample bound of $O(s \log p)$ as obtained by Lasso, improving upon previous bounds for both the median and optimal splitting criteria. Our results extend to the non-linear regime as well as arbitrary sub-Gaussian distributions, demonstrating that tree based methods attain strong feature selection properties under a wide variety of settings and further shedding light on the success of these methods in practice. As a byproduct of our analysis, we show that we can provably guarantee recovery even when the number of active features $s$ is unknown. We further validate our theoretical results and proof methodology using computational experiments.
翻译:决策树被广泛用于低计算成本、良好的预测性能以及评估特征重要性的能力。虽然这些方法的理论保障通常用于选择特征,但人们并不十分理解这些方法的理论保障。我们在这里获得一个严格有限的样本,用于利用单深度决定性树进行线性回归的特征选择问题。我们对这些“决定立木”的统计特性进行了检查,以便从美元总额(美元=美元=美元=美元)中恢复美元活动特征。我们的分析为高维稀有系统提供了严格的样本性能保障,这些系统与Lasso获得的美元(美元=log p)的有限样本结合一致,改进了中位和最佳分裂标准以前的界限。我们的结果延伸至非线性制度以及任意的亚高加索地区分布,表明基于树木的方法在各种环境中取得了很强的特征选择属性,并进一步揭示了这些方法在实践中的成功。作为我们分析的一个副产品,我们表明,即使积极性要素(美元)的数量是未知的,我们也可以保证回收。我们进一步验证了我们的理论结果和实验方法。</s>