Empirical studies of the loss landscape of deep networks have revealed that many local minima are connected through low-loss valleys. Yet, little is known about the theoretical origin of such valleys. We present a general framework for finding continuous symmetries in the parameter space, which carve out low-loss valleys. Our framework uses equivariances of the activation functions and can be applied to different layer architectures. To generalize this framework to nonlinear neural networks, we introduce a novel set of nonlinear, data-dependent symmetries. These symmetries can transform a trained model such that it performs similarly on new samples, which allows ensemble building that improves robustness under certain adversarial attacks. We then show that conserved quantities associated with linear symmetries can be used to define coordinates along low-loss valleys. The conserved quantities help reveal that using common initialization methods, gradient flow only explores a small part of the global minimum. By relating conserved quantities to convergence rate and sharpness of the minimum, we provide insights on how initialization impacts convergence and generalizability.
翻译:深度网络的损失景观的实证研究表明,许多局部极小值通过低损失山谷相连。然而,关于这些山谷的理论起源知之甚少。本文提出了一种在参数空间中寻找连续对称性的通用框架,用于开凿低损失山谷。我们的框架使用激活函数的等变性,并可应用于不同层架构。为将此框架推广到非线性神经网络,我们引入了一组新颖的非线性、数据依赖性对称性。这些对称性可以将已训练好的模型变换为类似于新样本的表现形式,从而允许建立改善某些对抗攻击下的鲁棒性的集成。然后,我们展示了与线性对称性相关联的守恒量可以用于定义沿着低损失山谷的坐标。守恒量可以揭示使用常见初始化方法时,梯度流只探索了全局最小值的一小部分。通过将守恒量与收敛速度和最小值的尖锐度联系起来,我们提供了关于初始化如何影响收敛和泛化的见解。