Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information, resulting in reduced or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve.
翻译:计算机学习基于一般化概念:在一个足够大的培训数据集上达到低误差的模型,同样也应对同一分布的新型样本产生良好的效果。我们表明,数据白化和第二顺序优化都会损害或完全防止一般化。一般而言,模型培训利用数据集样本样本第二瞬间矩阵中的信息。对于一般的模型类别,即具有完全连接第一层的模型,我们证明,本矩阵中所含的信息是可用于概括化的唯一信息。使用白化数据或某些第二顺序优化计划培训的模型,获得这种信息的机会较少,导致一般化能力降低或不存在。我们实验性地核查若干结构的这些预测,并进一步证明即使在理论要求放松时,一般化仍然受到损害。但是,我们还实验性地表明,正规化的第二顺序优化可以提供实际的交换,在培训加快但信息减少的情况下,一般化在某些情况下甚至可以改进。