In many studies, we want to determine the influence of certain features on a dependent variable. More specifically, we are interested in the strength of the influence -- i.e., is the feature relevant? -- and, if so, how the feature influences the dependent variable. Recently, data-driven approaches such as \emph{random forest regression} have found their way into applications (Boulesteix et al., 2012). These models allow to directly derive measures of feature importance, which are a natural indicator of the strength of the influence. For the relevant features, the correlation or rank correlation between the feature and the dependent variable has typically been used to determine the nature of the influence. More recent methods, some of which can also measure interactions between features, are based on a modeling approach. In particular, when machine learning models are used, SHAP scores are a recent and prominent method to determine these trends (Lundberg et al., 2017). In this paper, we introduce a novel notion of feature importance based on the well-studied Gram-Schmidt decorrelation method. Furthermore, we propose two estimators for identifying trends in the data using random forest regression, the so-called absolute and relative transversal rate. We empirically compare the properties of our estimators with those of well-established estimators on a variety of synthetic and real-world datasets.
翻译:在许多研究中,我们想要确定某些特征对依附变量的影响。更具体地说,我们感兴趣的是影响的力量 -- -- 即特征是否相关? -- -- 以及如果是,特征如何影响依附变量。最近,数据驱动的方法(如:emph{random森林回归)已经进入应用(Boulesteix等人,2012年)。这些模型允许直接得出特征重要性的计量,这是影响强度的自然指标。对于相关特征,特征和依附变量之间的相互关系或等级关系通常被用来确定影响的性质。最近采用的方法(其中一些也可以衡量各特征之间的相互作用)是以建模方法为基础的。特别是,在使用机器学习模型时,SHAP计分数是确定这些趋势的最新和突出的方法(伦德贝里等人,2017年)。在本文中,我们根据经过充分研究的Gram-Schmidt调控方法,我们提出了一个新的特征重要性概念。此外,我们建议用两个估测算器来测定绝对回归率趋势,我们用随机的森林回归数据进行对比。</s>