We investigate the spectral properties of linear-width feed-forward neural networks, where the sample size is asymptotically proportional to network width. Empirically, we show that the weight spectra in this high dimensional regime are invariant when trained by gradient descent for small constant learning rates and the changes in both operator and Frobenius norm are $\Theta(1)$ in the limit. This implies the bulk spectra for both the conjugate and neural tangent kernels are also invariant. We demonstrate similar characteristics for models trained with mini-batch (stochastic) gradient descent with small learning rates and provide a theoretical justification for this special scenario. When the learning rate is large, we show empirically that an outlier emerges with its corresponding eigenvector aligned to the training data structure. We also show that after adaptive gradient training, where we have a lower test error and feature learning emerges, both the weight and kernel matrices exhibit heavy tail behavior. Different spectral properties such as invariant bulk, spike, and heavy-tailed distribution correlate to how far the kernels deviate from initialization. To understand this phenomenon better, we focus on a toy model, a two-layer network on synthetic data, which exhibits different spectral properties for different training strategies. Analogous phenomena also appear when we train conventional neural networks with real-world data. Our results show that monitoring the evolution of the spectra during training is an important step toward understanding the training dynamics and feature learning.
翻译:我们调查线形向向神经网络的光谱特性, 样本大小与网络宽度成比例。 生动地, 我们显示, 这个高维系统中的重量光谱在通过梯度梯度下降来训练时是无差异的, 并且操作者和Frobenius规范的变化在极限中是$\Theta(1)美元。 这意味着共和和神经相向内核的散数光谱也是无差异的。 我们展示了以小学习率以微量( 随机) 梯度下降为基底的模型的相似特性, 并为这一特殊情景提供了理论上的理由。 当学习率大时, 我们从实验性地展示出一个外端, 其对应的惯性能与培训数据结构一致。 我们还显示, 在适应性梯度训练后, 我们的测试错误和特征学习出现一个较低, 重量和内核内核基质矩阵都表现出强烈的尾部行为。 不同的光谱特性, 比如, 微量的( ) 梯度( ) 梯度( ) 梯度( ) 梯度) 梯度下降级递) 下降 下降 递递递递递递递递 分布分布分布分布分布分布分布分布分布 显示我们如何 如何 。