超参数化模型的偏差、差异和内插 (Memorizing without overfitting: Bias, variance, and interpolation in over-parameterized models)

The bias-variance trade-off is a central concept in supervised learning. In classical statistics, increasing the complexity of a model (e.g., number of parameters) reduces bias but also increases variance. Until recently, it was commonly believed that optimal performance is achieved at intermediate model complexities which strike a balance between bias and variance. Modern Deep Learning methods flout this dogma, achieving state-of-the-art performance using "over-parameterized models" where the number of fit parameters is large enough to perfectly fit the training data. As a result, understanding bias and variance in over-parameterized models has emerged as a fundamental problem in machine learning. Here, we use methods from statistical physics to derive analytic expressions for bias and variance in three minimal models for over-parameterization (linear regression and two-layer neural networks with linear and nonlinear activation functions), allowing us to disentangle properties stemming from the model architecture and random sampling of data. All three models exhibit a phase transition to an interpolation regime where the training error is zero. At the interpolation transition for each model, the test error diverges due to diverging variance (while bias remains finite). In contrast with classical intuition, we also show that over-parameterized models can overfit even in the absence of noise and exhibit bias even if the student and teacher models match. We synthesize these results to construct a holistic understanding of generalization error and the bias-variance trade-off in over-parameterized models and relate our results to random matrix theory.

翻译：偏差权衡是受监督学习的一个核心概念。在古典统计中,由于模型的复杂性(例如参数数量)的增加,减少了偏差,但也增加了差异。直到最近,人们通常认为,在中间模型复杂度中达到最佳性能,从而在偏差和差异之间取得平衡。现代深层次学习方法蔑视这一教条,利用“超分化模型”实现最先进的性能,在“超分化模型”中,合适参数的数量足够大,足以完全适合培训数据。因此,在传统统计中,理解过度分数模型的偏差和差异已成为机器学习的一个根本问题。在这里,我们使用统计物理学的理论方法,在三个最起码的模型(线性回归和双层神经网络,具有线性和非线性激活功能)中得出偏差和差异的最佳性表现。这三套模型都展示了向内部体系的阶段过渡,培训误差是零的。在每种模型的间推过程中,测试误差甚至与师资的偏差性差有关,在模型中,我们还可以显示典型的偏差性偏差,我们在度模型中也表现出偏差。