随机森林差异估计 (On Variance Estimation of Random Forests)

Ensemble methods, such as random forests, are popular in applications due to their high predictive accuracy. Existing literature views a random forest prediction as an infinite-order incomplete U-statistic to quantify its uncertainty. However, these methods focus on a small subsampling size of each tree, which is theoretically valid but practically limited. This paper develops an unbiased variance estimator based on incomplete U-statistics, which allows the tree size to be comparable with the overall sample size, making statistical inference possible in a broader range of real applications. Simulation results demonstrate that our estimators enjoy lower bias and more accurate coverage rate without additional computational costs. We also propose a local smoothing procedure to reduce the variation of our estimator, which shows improved numerical performance when the number of trees is relatively small. Further, we investigate the ratio consistency of our proposed variance estimator under specific scenarios. In particular, we develop a new "double U-statistic" formulation to analyze the Hoeffding decomposition of the estimator's variance.

翻译：随机森林等综合方法因其预测准确性很高,在应用中很受欢迎。现有文献认为随机森林预测是一种无限的不完整的U-统计性,可以量化其不确定性。然而,这些方法侧重于每一棵树的小型子抽样,在理论上是有效的,但实际上是有限的。本文根据不完整的U-统计性方法开发了一个公正的差异估计值,使树木大小能够与整个样本大小相比,使统计推断在更广泛的实际应用中成为可能。模拟结果表明我们的估计值在不增加计算成本的情况下享有较低的偏差和更准确的覆盖率。我们还提出一个本地平滑程序,以减少我们的估计值的变化,在树木数量相对小的情况下,该估计值显示数字性能的改善。此外,我们调查了在具体情况下我们提议的差异估计值的比重是否一致。特别是,我们开发了一种新的“双重U-统计性”配方,以分析估计值差异的偏差。