Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
翻译:随机森林被认为是最佳的出局分类和回归算法之一,因为它们的预测性能较高,且调整相对较少。对称相似性可以从经过训练的随机森林中计算出来,并测量与监督任务相关的数据点之间的相似性。随机森林相似性被应用于许多应用中,包括确定变量重要性、数据估算、异端探测和数据可视化。然而,随机森林近似性的现有定义并不准确反映随机森林所学的数据几何学。在本文中,我们引入了随机森林近似性的新定义,称为随机森林测地学和准确性-保全近似性。我们证明,使用RF-GAP的近比加权(回归性)或多数票(分类)与包外随机森林预测完全吻合,从而捕捉随机森林所学的数据几何学。我们从经验上表明,这种改进的几何表示方式在诸如数据精确度测量等任务中,比传统的随机森林随机近似性森林近似性(RF-GAP-GAR-GAR),并且提供了与所学结果的近似性探测结果。</s>