It is notoriously difficult to build a bad Random Forest (RF). Concurrently, RF blatantly overfits in-sample without any apparent consequence out-of-sample. Standard arguments, like the classic bias-variance trade-off or double descent, cannot rationalize this paradox. I propose a new explanation: bootstrap aggregation and model perturbation as implemented by RF automatically prune a latent "true" tree. More generally, randomized ensembles of greedily optimized learners implicitly perform optimal early stopping out-of-sample. So there is no need to tune the stopping point. By construction, novel variants of Boosting and MARS are also eligible for automatic tuning. I empirically demonstrate the property, with simulated and real data, by reporting that these new completely overfitting ensembles perform similarly to their tuned counterparts -- or better.
翻译:建立坏的随机森林( RF) 十分困难。 同时, RF 明目张胆地在没有明显结果的标本中标出任何明显的标本。 标准参数,如经典的偏差取舍或双向下降,不能使这一悖论合理化。 我提议一个新的解释: 由 RF 自动地将潜在的“ 真正的” 树植入一个“ 原始” 树, 所执行的靴套集和模型扰动。 更一般地说, 贪婪优化的学习者随机组合暗含着最理想的早期停止标本。 因此不需要调整停机点。 通过构建, 博彩和MARS 的新变体也可以自动调整。 我用模拟和真实的数据, 实证地展示了这些新的完全超配的集合功能, 与其调整的对应方类似 -- 或更好。