Variable selection is an important statistical problem. This problem becomes more challenging when the candidate predictors are of mixed type (e.g. continuous and binary) and impact the response variable in nonlinear and/or non-additive ways. In this paper, we review existing variable selection approaches for the Bayesian additive regression trees (BART) model, a nonparametric regression model, which is flexible enough to capture the interactions between predictors and nonlinear relationships with the response. An emphasis of this review is on the capability of identifying relevant predictors. We also propose two variable importance measures which can be used in a permutation-based variable selection approach, and a backward variable selection procedure for BART. We present simulations demonstrating that our approaches exhibit improved performance in terms of the ability to recover all the relevant predictors in a variety of data settings, compared to existing BART-based variable selection methods.
翻译:变量选择是一个重要的统计问题。当候选预测器具有混合类型(如连续和二进制)并影响非线性和(或)非补充方式的响应变量时,这一问题就变得更加具有挑战性。在本文件中,我们审查了巴伊西亚累进回归树(BART)模型的现有变量选择方法,这是一个非参数回归模型,它足够灵活,足以反映预测器与响应的非线性关系之间的相互作用。本审查的重点是确定相关预测器的能力。我们还提出了两种变量重要性措施,可用于基于变异选择法的变异选择方法,以及BART的后向变量选择程序。我们提出模拟表明,与现有的基于BART的变量选择方法相比,我们的方法在各种数据环境中恢复所有相关预测器的能力方面表现了更好的表现。