Examples are given of data-generating models under which some versions of the random forest algorithm may fail to be consistent or be extremely slow to converge to the optimal predictor. The evidence provided for these properties is based on mostly intuitive arguments, similar to those used earlier with simpler examples, and on numerical experiments. Although one can always choose a model under which random forests perform very badly, it is shown that when substantial improvement is possible simple methods based on statistics of 'variable use' and 'variable importance' may indicate a better predictor based on a sort of mixture of random forests; thus, by acknowledging the difficulties posed by some models one may improve the performance of random forests in some applications.
翻译:举例来说,一些类型的随机森林算法可能无法保持一致,或极慢地无法与最佳预测器汇合,为这些特性提供的证据大多基于直觉性的论点,类似于先前使用更简单的例子时所用的论据,以及基于数字实验。虽然人们总是可以选择一种模式,随机森林的表现非常差,但可以证明,如果根据“可变使用”和“可变重要性”的统计,实质性改进可能是基于“随机森林”的统计的简单方法,那么,根据某种随机森林的混合,可以显示一种更好的预测;因此,通过承认某些模型造成的困难,可以在某些应用中改善随机森林的性能。