Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest which measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly match the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
翻译:随机森林被认为是最佳的出局分类和回归算法之一,因为它们的预测性能较高,且调整相对较少。对称相似性可以从经过训练的随机森林中计算出来,这种随机随机森林可以测量与监督任务的数据点之间的相似性。随机森林相似性在许多应用中被使用,包括确定变量重要性、数据估算、外部探测和数据可视化。然而,随机森林比值的现有定义并不准确地反映随机森林所学的数据几何学。在本文中,我们引入了随机森林比对随机森林比对性的新定义,称为随机森林测度和准确性-保全近似性(RF-GAP)。我们证明,使用RF-GAP的接近加权(回归性)或多数票(分类)与包外随机森林预测完全吻合,从而捕捉随机森林所学的数据几何学。我们从经验上表明,这种改进的几何代表法比传统的随机森林比对诸如数据精确度测量等任务中传统的随机森林比对传统随机性森林比。我们证明,我们用了连续的测地测量数据,并提供了外探测结果。