Random forests is one of the most widely used machine learning methods over the past decade thanks to its outstanding empirical performance. Yet, because of its black-box nature, the results by random forests can be hard to interpret in many big data applications. Quantifying the usefulness of individual features in random forests learning can greatly enhance its interpretability. Existing studies have shown that some popularly used feature importance measures for random forests suffer from the bias issue. In addition, there lack comprehensive size and power analyses for most of these existing methods. In this paper, we approach the problem via hypothesis testing, and suggest a framework of the self-normalized feature-residual correlation test (FACT) for evaluating the significance of a given feature in the random forests model with bias-resistance property, where our null hypothesis concerns whether the feature is conditionally independent of the response given all other features. Such an endeavor on random forests inference is empowered by some recent developments on high-dimensional random forests consistency. The vanilla version of our FACT test can suffer from the bias issue in the presence of feature dependency. We exploit the techniques of imbalancing and conditioning for bias correction. We further incorporate the ensemble idea into the FACT statistic through feature transformations for the enhanced power. Under a fairly general high-dimensional nonparametric model setting with dependent features, we formally establish that FACT can provide theoretically justified random forests feature p-values and enjoy appealing power through nonasymptotic analyses. The theoretical results and finite-sample advantages of the newly suggested method are illustrated with several simulation examples and an economic forecasting application in relation to COVID-19.
翻译:随机森林是过去十年中最广泛使用的机械学习方法之一。然而,由于它具有出色的实证性,因此随机森林的结果在许多大数据应用中很难解释。随机森林学习中个别特点的效用的量化可以大大提高其可解释性。现有研究显示,随机森林的一些常用重要措施因偏差问题而受到影响。此外,对大多数这些现有方法缺乏全面的尺寸和功率分析。在本文中,我们通过假设测试来解决这个问题,并提出一个自我调整的地貌-再现相关测试框架(FACT)来评估随机森林模式中带有偏差特性的重要性。在随机森林学习中,我们完全没有假设地担心该特性是否有条件地独立于所有其他特点的反应。随机森林的这种努力因高度随机随机森林一致性方面的一些最新发展而得到增强。我们FACT测试的正确性版本可以通过地貌依赖性的偏差问题而产生。我们利用不固定和调节性地平比值的理论性关系来评估偏向性地分析偏差。我们通过直观性分析,进一步将高度的精确性经济特征纳入直观性统计的特征。我们建议,通过直观性变法度分析,通过直观性分析,我们进一步将高度的精确性地方法,进一步将随机地变。