Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.
翻译:随机森林( RFs) 是机器学习中最先进的技术之一, 并且以近乎零参数的调试来提供优异性能。 值得注意的是, RFs 似乎不易过度装配, 尽管它们的基本构件是众所周知的。 最近, 广泛接受的一项研究认为, RF 显示的是所谓的双月曲线: 首先, 模型将数据放在一个方形曲线中, 然后, 一旦达到某种模型的复杂性, 它就会突然改善它的业绩。 在本文中, 我们质疑模型能力是解释RF成功与否的正确工具, 并且认为培训模型的算法比以前想象的要重要得多。 我们表明, RFs没有出现双月曲线曲线, 而是有一个单一的曲线曲线曲线。 我们进一步展示了一种RF的变异性, 尽管它的决定性交易范围接近于一个低度的DT。 类似地, 我们发现, 一种接近RF决定界成功度的工具, 而不是最关键的工具。 然而, 我们从现在看, 一种不同的性能展示出一种不同的业绩。