In this paper, we study the sharpness of a deep learning (DL) loss landscape around local minima in order to reveal systematic mechanisms underlying the generalization abilities of DL models. Our analysis is performed across varying network and optimizer hyper-parameters, and involves a rich family of different sharpness measures. We compare these measures and show that the low-pass filter-based measure exhibits the highest correlation with the generalization abilities of DL models, has high robustness to both data and label noise, and furthermore can track the double descent behavior for neural networks. We next derive the optimization algorithm, relying on the low-pass filter (LPF), that actively searches the flat regions in the DL optimization landscape using SGD-like procedure. The update of the proposed algorithm, that we call LPF-SGD, is determined by the gradient of the convolution of the filter kernel with the loss function and can be efficiently computed using MC sampling. We empirically show that our algorithm achieves superior generalization performance compared to the common DL training strategies. On the theoretical front, we prove that LPF-SGD converges to a better optimal point with smaller generalization error than SGD.
翻译:在本文中,我们研究了当地微型模型的深度学习(DL)损失情况,以揭示DL模型一般化能力所依据的系统机制。我们的分析是在不同的网络和优化超参数中进行的,并涉及一个由不同锐度测量组成的丰富大家庭。我们比较了这些措施,并表明低通路过滤测量与DL模型一般化能力的关系最大,对数据和标签噪音都具有高度的活力,还可以跟踪神经网络的双向下降行为。我们接下来要利用低通路过滤器(LPF)得出优化算法,利用SGD式程序积极搜索DL优化环境中的平坦区域。我们称之为LPF-SGD的拟议算法的更新是由过滤内核与损失函数的递增速度决定的,并且可以用MC抽样有效计算。我们从经验中表明,我们的算法比通用DLL培训战略取得更高的一般化性能。在理论上证明,LPF-SGD比一般误差更接近最佳点。