Empirical studies of the loss landscape of deep networks have revealed that many local minima are connected through low-loss valleys. Ensemble models sampling different parts of a low-loss valley have reached SOTA performance. Yet, little is known about the theoretical origin of such valleys. We present a general framework for finding continuous symmetries in the parameter space, which carve out low-loss valleys. Importantly, we introduce a novel set of nonlinear, data-dependent symmetries for neural networks. These symmetries can transform a trained model such that it performs similarly on new samples. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability. We also find the nonlinear action to be viable for ensemble building to improve robustness under certain adversarial attacks.
翻译:深网络损失地貌的经验研究表明,许多本地微型模型是通过低损失谷地连接起来的。对低损失谷地不同部分取样的混合模型已经达到SOTA的性能。然而,对此类谷地的理论起源知之甚少。我们提出了一个在参数空间中寻找连续的对称性的一般框架,该参数空间将产生低损失谷地。重要的是,我们引入了一套新型的非线性、数据依赖性神经网络的对称性。这些对称性可以改变一个经过训练的模型,使其在新样本上也能发挥类似的作用。然后我们表明,与线性对称性相关的节制数量可用于界定低损失谷地的坐标。这些节制数量有助于显示,使用共同的初始化方法,梯度流只探索了全球最低值的一小部分。通过将节制数量与最小值的趋同率和锐性联系起来,我们提出了关于初始化如何影响趋同性和可概括性的洞察力。我们还发现非线性行动对于在某种对抗性攻击下进行感应的建筑是可行的。