The question of how and why the phenomenon of mode connectivity occurs in training deep neural networks has gained remarkable attention in the research community. From a theoretical perspective, two possible explanations have been proposed: (i) the loss function has connected sublevel sets, and (ii) the solutions found by stochastic gradient descent are dropout stable. While these explanations provide insights into the phenomenon, their assumptions are not always satisfied in practice. In particular, the first approach requires the network to have one layer with order of $N$ neurons ($N$ being the number of training samples), while the second one requires the loss to be almost invariant after removing half of the neurons at each layer (up to some rescaling of the remaining ones). In this work, we improve both conditions by exploiting the quality of the features at every intermediate layer together with a milder over-parameterization condition. More specifically, we show that: (i) under generic assumptions on the features of intermediate layers, it suffices that the last two hidden layers have order of $\sqrt{N}$ neurons, and (ii) if subsets of features at each layer are linearly separable, then no over-parameterization is needed to show the connectivity. Our experiments confirm that the proposed condition ensures the connectivity of solutions found by stochastic gradient descent, even in settings where the previous requirements do not hold.
翻译:在深神经网络的培训中,模式连通现象是如何发生以及为什么发生的问题引起了研究界的显著关注。从理论角度看,提出了两种可能的解释:(一) 损失功能与分层相联,以及(二) 悬浮梯度下降所发现的解决办法是稳定的。虽然这些解释提供了对这一现象的洞察力,但它们的假设在实践中并不总能满足。特别是,第一种办法要求网络有一个层次,其神经元的排序为$$(培训样本数量为$),而第二种办法则要求在每个层清除一半神经元后损失几乎是无变的(直至剩余神经元的某种调整)。在这项工作中,我们通过利用每个中间层特征的质量以及较轻的超比分度条件来改善两种条件。我们更具体地表明:(一) 根据对中间层特征的一般假设,最后两个隐蔽层的排序为$\sqrt{N} 神经元,而第二个办法则要求在每个层清除半个神经元之后几乎是无变的。 (二) 如果每个层的特征的子系的分类不是以直线性分级的方式确认我们所需要的连通性变的路径,那么,那么,那么,那么,我们提议的连接的梯级的根基系将显示我们所建的根系的根系的根系的根系将显示为直系的根基系的根基。