The classical statistical learning theory says that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. The implicit regularization induced by stochastic gradient descent (SGD) has been regarded to be important, but its specific principle is still unknown. In this work, we study how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima. Our work attempts to better connect non-convex optimization and generalization analysis with uniform convergence.
翻译:古典统计学理论指出,适应过多的参数会导致过度适应和不良性能。现代深神经网络尽管有大量参数,却非常普遍,这与这一发现相矛盾,是解释深层学习成功与否方面一个重大的未解决的问题。人们认为,由随机梯度梯度下降(SGD)引起的隐含的正规化十分重要,但其具体原则仍然未知。在这项工作中,我们研究当地微型小行星周围能源景观的本地几何如何用高斯梯度噪音影响SGD的统计特性。我们认为,根据合理的假设,当地几何测量迫使SGD接近一个低维度统一度亚空空间,这是在解释深层神经网络的总体错误方面造成隐含的正规化和结果。为了得出神经网络的一般误差,我们首先在本地小型小行星周围引入了一种停滞概念,并强加了当地人口风险的基本凝固特性。在这种情况下,SGDD将留在这些停滞结构中的较低界限是推导出来的。如果出现停滞,那么我们就会在以更精确的精确的精确度网络的精度值上,我们基于光谱的精确的精确度模型模型模型的精确的精确度值上,那么,我们根据我们的标准矩阵的精确的模型的精确的精确的精确的模型的模型的模型的模型的精确度变化的精确度的精确度的精确度的精确度的精确度的精确度的精确值值值值值值的数值的数值的数值值的数值的数值的数值的数值的数值的数值的数值是分数。