随机森林没有双月亮 (There is no Double-Descent in Random Forests)

Random Forests (RFs) are among the state-of-the-art in machine learning and offer excellent performance with nearly zero parameter tuning. Remarkably, RFs seem to be impervious to overfitting even though their basic building blocks are well-known to overfit. Recently, a broadly received study argued that a RF exhibits a so-called double-descent curve: First, the model overfits the data in a u-shaped curve and then, once a certain model complexity is reached, it suddenly improves its performance again. In this paper, we challenge the notion that model capacity is the correct tool to explain the success of RF and argue that the algorithm which trains the model plays a more important role than previously thought. We show that a RF does not exhibit a double-descent curve but rather has a single descent. Hence, it does not overfit in the classic sense. We further present a RF variation that also does not overfit although its decision boundary approximates that of an overfitted DT. Similar, we show that a DT which approximates the decision boundary of a RF will still overfit. Last, we study the diversity of an ensemble as a tool the estimate its performance. To do so, we introduce Negative Correlation Forest (NCForest) which allows for precise control over the diversity in the ensemble. We show, that the diversity and the bias indeed have a crucial impact on the performance of the RF. Having too low diversity collapses the performance of the RF into a a single tree, whereas having too much diversity means that most trees do not produce correct outputs anymore. However, in-between these two extremes we find a large range of different trade-offs with all roughly equal performance. Hence, the specific trade-off between bias and diversity does not matter as long as the algorithm reaches this good trade-off regime.

翻译：随机森林( RFs) 是机器学习中最先进的技术之一, 并且以近乎零参数的调试来提供优异性能。值得注意的是, RFs 似乎不易过度装配, 尽管它们的基本构件是众所周知的。最近, 广泛接受的一项研究认为, RF 显示的是所谓的双月曲线: 首先, 模型将数据放在一个方形曲线中, 然后, 一旦达到某种模型的复杂性, 它就会突然改善它的业绩。在本文中, 我们质疑模型能力是解释RF成功与否的正确工具, 并且认为培训模型的算法比以前想象的要重要得多。我们表明, RFs没有出现双月曲线曲线, 而是有一个单一的曲线曲线曲线。我们进一步展示了一种RF的变异性, 尽管它的决定性交易范围接近于一个低度的DT。类似地, 我们发现, 一种接近RF决定界成功度的工具, 而不是最关键的工具。然而, 我们从现在看, 一种不同的性能展示出一种不同的业绩。

相关内容

过拟合

关注 8

过拟合，在AI领域多指机器学习得到模型太过复杂，导致在训练集上表现很好，然而在测试集上却不尽人意。过拟合（over-fitting）也称为过学习，它的直观表现是算法在训练集上表现好，但在测试集上表现不好，泛化性能差。过拟合是在模型参数拟合过程中由于训练数据包含抽样误差，在训练时复杂的模型将抽样误差也进行了拟合导致的。

【干货书】'Mastering Go 第二版中文版'，143页pdf

专知会员服务

48+阅读 · 2020年11月1日

【干货书】Python深度学习第二版，Deep Learning with Python, Second Edition

专知会员服务

167+阅读 · 2020年5月9日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日