Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al.\ recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes.
翻译:由不同随机初始化的随机梯度梯度下降(SGD)培训的神经网络通常会发现功能非常相似的解决方案,从而引发不同 SGD 解决方案之间是否存在有意义的差异的问题。 Entezari 等人最近推测,尽管初始化不同,SGD发现的解决办法在考虑神经网络的变异之后就位于同一损失谷。具体地说,他们假设SGD发现的任何两种解决方案都可以被移动,以至于其参数之间的线性插图形成一条路径,而损失却不会大幅增加。在这里,我们使用简单而有力的算法来找到这种变异,使我们能够直接获得经验证据,证明在完全连接的网络中,假设是真实的。我们发现,两个网络在初始化时已经生活在同一损失谷中,并且平均其随机性,但相适应的初始化却比机会大得多。相反,对于进化结构,我们的证据表明,这一假设并没有形成一条路径,特别是在一个大学习率制度中,SGD似乎发现了不同的模式。