Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called "double descent". Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. This leads to discovering the unimodality of variance as a function of the level of parametrization, and to decomposing the variance into that arising from label noise, initialization, and randomness in the training data to understand the sources of the error. In this work we develop a deeper understanding of this area. Specifically, we propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks. The advantage of the analysis of variance is that it reveals the effects of initialization, label noise, and training data more clearly than prior approaches. Moreover, we also study the monotonicity and unimodality of the variance components. While prior work studied the unimodality of the overall variance, we study the properties of each term in variance decomposition. One key insight is that in typical settings, the interaction between training samples and initialization can dominate the variance; surprisingly being larger than their marginal effect. Also, we characterize "phase transitions" where the variance changes from unimodal to monotone. On a technical level, we leverage advanced deterministic equivalent techniques for Haar random matrices, that---to our knowledge---have not yet been used in the area. We also verify our results in numerical simulations and on empirical data examples.
翻译:现代机器学习方法往往过于不对称, 使得数据适应到微调水平。 这似乎令人费解; 最坏的情况是, 此类模型并不需要概括化。 这个谜题激发了大量的工作, 当过度平衡减少测试错误时, 在一种叫做“ 双向下降 ” 的现象中引起争论。 最近的工作旨在更深入地理解为什么过度平衡有助于概括化。 这导致发现差异的单一性能是平衡化水平的函数, 并且将差异分解成因标签噪音、 初始化以及培训数据的随机性变化而产生的差异, 从而理解错误的来源。 在这个工作中, 我们更深入地理解这个领域。 具体地说, 我们建议使用差异分析( NOVA) 来分解测试错误, 研究某些两层线性和非线性网络的概括性表现。 差异分析的优势在于它揭示了初始化、 标签噪音和训练数据的不透明性变化, 而不是初步的变异性, 我们用的是前期的变异性数据 。