Statistical wisdom suggests that very complex models, interpolating training data, will be poor at prediction on unseen examples. Yet, this aphorism has been recently challenged by the identification of benign overfitting regimes, specially studied in the case of parametric models: generalization capabilities may be preserved despite model high complexity. While it is widely known that fully-grown decision trees interpolate and, in turn, have bad predictive performances, the same behavior is yet to be analyzed for random forests. In this paper, we study the trade-off between interpolation and consistency for several types of random forest algorithms. Theoretically, we prove that interpolation regimes and consistency cannot be achieved for non-adaptive random forests. Since adaptivity seems to be the cornerstone to bring together interpolation and consistency, we introduce and study interpolating Adaptive Centered Forests, which are proved to be consistent in a noiseless scenario. Numerical experiments show that Breiman's random forests are consistent while exactly interpolating, when no bootstrap step is involved. We theoretically control the size of the interpolation area, which converges fast enough to zero, so that exact interpolation and consistency occur in conjunction.
翻译:统计智慧表明,非常复杂的模型,即培训数据的内插性,在对不可见的例子进行预测时将很难预测。然而,这一词论最近因确定无害的过度适应制度而遇到挑战,在参数模型中特别研究过:尽管模型复杂程度很高,一般能力还是可以保存的。虽然众所周知,成熟的决策树的内插性,反过来又具有不良的预测性,但对随机森林的同一行为尚有待分析。在本文中,我们研究若干类型的随机森林算法的内插性和一致性之间的权衡。理论上,我们证明非适应性随机森林的内插性和一致性是无法实现的。由于适应性似乎是将内插性和一致性结合在一起的基石,我们引入和研究内插性中心森林,这在无噪音的假设中证明是一致的。数字实验表明,布雷曼的随机森林是一致的,而精确的内插性,而没有靴系步骤。我们理论上控制了内插性区域的大小,这种内插性是接近零的,因此精确的内插性。