A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sub-level set when training within a random subspace of a given training dimensionality. We find a sharp phase transition in the success probability from $0$ to $1$ as the training dimension surpasses a threshold. This threshold training dimension increases as the desired final loss decreases, but decreases as the initial loss decreases. We then theoretically explain the origin of this phase transition, and its dependence on initialization and final desired loss, in terms of properties of the high-dimensional geometry of the loss landscape. In particular, we show via Gordon's escape theorem, that the training dimension plus the Gaussian width of the desired loss sub-level set, projected onto a unit sphere surrounding the initialization, must exceed the total number of parameters for the success probability to be large. In several architectures and datasets, we measure the threshold training dimension as a function of initialization and demonstrate that it is a small fraction of the total parameters, implying by our theory that successful training with so few dimensions is possible precisely because the Gaussian width of low loss sub-level sets is very large. Moreover, we compare this threshold training dimension to more sophisticated ways of reducing training degrees of freedom, including lottery tickets as well as a new, analogous method: lottery subspaces. Code is available at https://github.com/ganguli-lab/degrees-of-freedom.
翻译:最近的各种工程,包括彩票、彩票和随机子空间内的训练,都表明深神经网络可以使用比参数总数少得多的自由度来训练,而使用的自由度则远远低于参数总数。我们首先研究在特定培训维度的随机亚空间内培训成功达到培训损失亚层面的可能性,然后对随机子空间进行随机现象的分析。我们发现,随着培训维度超过一个门槛,成功概率从0美元急剧过渡到1美元。随着预期的最终损失减少,这一门槛培训维度会增加,但随着初步损失的减少而减少。我们随后从理论上解释这一阶段过渡的起源及其对初始化和最终预期损失的依赖性。我们首先从高空的地理测量特性上对随机子空间的随机现象进行分析。我们通过Gordon's的理论标准显示,培训维度加上预期损失亚层面的高空宽度,预测在初始化周围的一个单位度范围内,必须超过成功概率的参数总数。在几个建筑和数据级中,我们测量这一阶段的自由度的初始度和最低培训维度,因为我们的精确度的理论维度比值正在预示着一个可能达到的深度的深度的亚值。