Decision trees and their ensembles are endowed with a rich set of diagnostic tools for ranking and screening variables in a predictive model. Despite the widespread use of tree based variable importance measures, pinning down their theoretical properties has been challenging and therefore largely unexplored. To address this gap between theory and practice, we derive finite sample performance guarantees for variable selection in nonparametric models using a single-level CART decision tree (a decision stump). Under standard operating assumptions in variable screening literature, we find that the marginal signal strength of each variable and ambient dimensionality can be considerably weaker and higher, respectively, than state-of-the-art nonparametric variable selection methods. Furthermore, unlike previous marginal screening methods that attempt to directly estimate each marginal projection via a truncated basis expansion, the fitted model used here is a simple, parsimonious decision stump, thereby eliminating the need for tuning the number of basis terms. Thus, surprisingly, even though decision stumps are highly inaccurate for estimation purposes, they can still be used to perform consistent model selection.
翻译:在预测模型中,决策树及其组合拥有一套丰富的用于排名和筛选变量的诊断工具。尽管广泛使用基于树的可变重要性措施,但将其理论属性定下来一直具有挑战性,因此基本上没有探索。为了缩小理论和实践之间的差距,我们用单级CART决策树(一个决定立木)为非参数模型的变量选择提供有限的抽样性能保障。根据可变筛选文献中的标准操作假设,我们发现每个变量和环境维度的边际信号强度可能大大低于和高于最先进的非参数变量选择方法。此外,与试图通过脱轨基础扩展直接估计每个边缘预测的以往边际筛选方法不同,这里使用的安装模型是一个简单、微妙的决定立木,从而消除了调整基础术语数量的必要性。因此,令人惊讶的是,即使决定立木对于估算目的来说高度不准确,它们仍然可以用来进行一致的模型选择。