As a flexible nonparametric learning tool, the random forests algorithm has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Unveiling the underlying mechanisms has led to some important recent theoretical results on the consistency of the random forests algorithm and its variants. However, to our knowledge, almost all existing works concerning random forests consistency in high dimensional setting were established for various modified random forests models where the splitting rules are independent of the response; a few exceptions assume simple data generating models with binary features. In light of this, in this paper we derive the consistency rates for the random forests algorithm associated with the sample CART splitting criterion, which is the one used in the original version of the algorithm, in a general high-dimensional nonparametric regression setting through a bias-variance decomposition analysis. Our new theoretical results show that random forests can indeed adapt to high dimensionality and allow for discontinuous regression function. Our bias analysis characterizes explicitly how the random forests bias depends on the sample size, tree height, and column subsampling parameter. Some limitations on our current results are also discussed.
翻译:作为一种灵活的非参数学习工具,随机森林算法被广泛应用到各种实际应用中,即使存在高维特征空间,也具有有吸引力的经验性表现。统一基本机制导致最近关于随机森林算法及其变体一致性的一些重要理论结果。然而,据我们所知,几乎所有关于高维环境中随机森林一致性的现有工作都是为各种修改过的随机森林模型而建立的,在这些模型中,分割规则与反应无关;少数例外假设有二元特征的简单数据生成模型。有鉴于此,我们在本文件中得出与样本CART分离标准相关的随机森林算法的一致性率,这是在原始算法中,通过偏差变异性分解分析,在一般高维非对数回归设置中使用的参数。我们的新理论结果表明,随机森林确实能够适应高维度并允许不连续的回归功能。我们的偏差分析明确说明了随机森林偏差如何取决于样本大小、树高和列子取样参数。我们目前结果的一些限制也得到了讨论。