Random forest is a popular machine learning approach for the analysis of high-dimensional data because it is flexible and provides variable importance measures for the selection of relevant features. However, the complex relationships between the features are usually not considered for the selection and thus also neglected for the characterization of the analysed samples. Here we propose two novel approaches that focus on the mutual impact of features in random forests. Mutual forest impact (MFI) is a relation parameter that evaluates the mutual association of the featurs to the outcome and, hence, goes beyond the analysis of correlation coefficients. Mutual impurity reduction (MIR) is an importance measure that combines this relation parameter with the importance of the individual features. MIR and MFI are implemented together with testing procedures that generate p-values for the selection of related and important features. Applications to various simulated data sets and the comparison to other methods for feature selection and relation analysis show that MFI and MIR are very promising to shed light on the complex relationships between features and outcome. In addition, they are not affected by common biases, e.g. that features with many possible splits or high minor allele frequencies are prefered.
翻译:随机森林是一种流行的高维数据分析机器学习方法,因其灵活性和可提供相关特征的重要性指标而备受青睐。但是,特征之间的复杂关系通常不考虑选择,因此也忽略了对分析样本的表征。在这里,我们提出了两种关注随机森林中特征之间相互影响的新方法。相互森林影响(MFI)是一种关系参数,评估特征与结果之间的相互关联性,因此超出了相关系数的分析。互作用不纯度减少(MIR)是一种重要性指标,将此关系参数与个体特征的重要性相结合。 MIR和MFI与测试程序一起实现,生成选择相关和重要特征的p值。应用于各种模拟数据集,并与其他特征选择和关系分析方法进行比较,结果表明MFI和MIR非常有前途,能揭示特征与结果之间的复杂关系。另外,它们不受常见偏见的影响,例如具有许多可能分裂或高小等位基因频率的特征会受到偏爱。