As a flexible nonparametric learning tool, random forests has been widely applied to various real applications with appealing empirical performance, even in the presence of high-dimensional feature space. Unveiling the underlying mechanisms has led to some important recent theoretical results on the consistency of the random forests algorithm and its variants. However, to our knowledge, all existing works concerning random forests consistency under the setting of high dimensionality were done for various modified random forests models where the splitting rules are independent of the response. In light of this, in this paper we derive the consistency rates for the random forests algorithm associated with the sample CART splitting criterion, which is the one used in the original version of the algorithm in Breiman (2001), in a general high-dimensional nonparametric regression setting through a bias-variance decomposition analysis. Our new theoretical results show that random forests can indeed adapt to high dimensionality and allow for discontinuous regression function. Our bias analysis characterizes explicitly how the random forests bias depends on the sample size, tree height, and column subsampling parameter. Some limitations of our current results are also discussed.
翻译:作为一种灵活的非参数学习工具,随机森林已被广泛应用于各种实际应用,并具有有吸引力的经验性表现,即使存在高维特征空间。保持基本机制已导致最近关于随机森林算法及其变异一致性的一些重要理论结果。然而,据我们所知,在高度维度的设置下,所有关于随机森林一致性的现有工作都是针对各种经修改的随机森林模型进行的,这些模型的分层规则独立于应对措施。根据这一点,我们在本文件中得出与样本CART分离标准相关的随机森林算法的一致性率,这是在布雷曼(2001年)的原始算法中使用的,该算法通过偏差分法分析,在一般高维非参数回归环境中使用。我们的新理论结果显示,随机森林确实可以适应高维度,并允许不连续的回归功能。我们所作的偏差分析清楚地说明了随机森林偏差如何取决于样本大小、树高和列子抽取参数。我们目前结果的一些局限性也得到了讨论。