Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged -- one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of random forests use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that random forests with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of "double descent" in random forests by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.
翻译:随机森林长期以来一直以优秀的现成预测器为名声,因此,随机森林仍然是应用统计人员和数据科学家选择的模式。尽管它们广泛使用,但直到最近,它们内部工作及其程序的哪些方面都鲜为人知,而且这些程序的哪些方面促使它们取得成功。最近出现了两个相互竞争的假设 -- -- 一个基于内推,另一个基于正规化。这项工作支持后者,利用正规化框架来重新审查数十年来一直存在的关于合谋中单个树木是否应该被切割的问题。尽管默认的随机森林建筑在最受欢迎的软件包中使用了接近完全深度的树木,但这里我们提供了有力的证据,表明树木深度应被视为整个程序的自然正规化形式。特别是,我们的工作表明,当数据中的信号对噪音比率低时,带浅树的随机森林是有利的。 在提出这一论点时,我们还批评了在随机森林中新流行的“双重血统”概念,即通过绘制与U统计学相近的图谱,并争论说,在随机森林的明显跳动是简单的平均结果。