Characterization of local minima draws much attention in theoretical studies of deep learning. In this study, we investigate the distribution of parameters in an over-parametrized finite neural network trained by ridge regularized empirical square risk minimization (RERM). We develop a new theory of ridgelet transform, a wavelet-like integral transform that provides a powerful and general framework for the theoretical study of neural networks involving not only the ReLU but general activation functions. We show that the distribution of the parameters converges to a spectrum of the ridgelet transform. This result provides a new insight into the characterization of the local minima of neural networks, and the theoretical background of an inductive bias theory based on lazy regimes. We confirm the visual resemblance between the parameter distribution trained by SGD, and the ridgelet spectrum calculated by numerical integration through numerical experiments with finite models.
翻译:本地微粒的特性在深层学习的理论研究中引起许多注意。 在这项研究中,我们调查了由山脊正规化实验风险最小化(RERM)培训的过度平衡的有限神经网络参数的分布。我们开发了脊椎变形的新理论,这是一种波盘式的有机变形,为不仅涉及ReLU而且涉及一般激活功能的神经网络理论研究提供了一个强大和一般的框架。我们显示参数的分布会与脊椎变异的频谱相融合。这个结果为神经网络本地微型的定性提供了新的洞察,以及基于懒惰制度的感偏差理论的理论背景。我们确认由SGD培训的参数分布与通过与定数模型的数值实验进行数字整合计算得出的脊椎谱之间的视觉相似性。