The success of deep learning is due in large part to our ability to solve certain massive non-convex optimization problems with relative ease. Though non-convex optimization is NP-hard, simple algorithms -- often variants of stochastic gradient descent -- exhibit surprising effectiveness in fitting large neural networks in practice. We argue that neural network loss landscapes contain (nearly) a single basin after accounting for all possible permutation symmetries of hidden units a la Entezari et al. (2021). We introduce three algorithms to permute the units of one model to bring them into alignment with a reference model in order to merge the two models in weight space. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100. Additionally, we investigate intriguing phenomena relating model width and training time to mode connectivity. Finally, we discuss shortcomings of the linear mode connectivity hypothesis, including a counterexample to the single basin theory.
翻译:深层次学习的成功在很大程度上归功于我们能否以相对容易的方式解决某些大规模非convex优化问题。虽然非convex优化是NP硬的简单算法 -- -- 往往是随机梯度下降的变种 -- -- 在实际中在安装大型神经网络时表现出惊人的效力。我们争辩说,神经网络丧失景观包含(近于)一个单一盆地,这是在计算了所有可能的调和性对称之后,一个“Entezari等人”(2021年)的隐藏单元。我们引入了三种算法,使一个模型的单位与一个参考模型保持一致,以便将两个模型合并在重量空间中。这种转变产生了一套功能上等同的权重,存在于靠近参考模型的大约 convex盆地。我们实验性地展示了各种模型结构和数据集中的单一流域现象,包括(我们所知的)第一个演示了独立训练的ResNet模型在CIFAR-10和CIFAR-100上之间的零压线性模式连接。此外,我们调查了与宽度和培训时空模型相关的模型现象,包括连接到单一模式的线性模型。最后我们讨论了一个线性模型的连接模式。