There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with $n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after $n^{O(k)}$ iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
翻译:随着我们扩大数据集、模型大小和培训时间,在深层次学习方法能力方面出现了新现象。 虽然有一些关于这些资源如何调整统计能力的说明, 但对于这些对模型培训计算问题的影响却知之甚少。 这项工作通过学习美元-零差差差差差差差差差差差差差差差差差差差差差差差差差差差差差数的透镜进行这样的探索。 这是一种具有理论计算障碍的典型问题。 在这个环境中, 我们发现神经网络在扩大数据集大小和运行时间时表现出令人惊讶的阶段转变。 特别是, 我们从经验中发现,通过标准培训, 各种建筑学会如何用美元( k) 美元的例子学习稀少的相似点, 而在美元(k) 美元反复计算后, 损失(和误差) 曲线突然下降。 这些积极的结果几乎与已知的SQ更低的界限相似, 即使没有明确的神经调整前。 我们用理论分析来解释这些现象的机制: 我们发现, 业绩的阶段转变并不是由于SGDG在“ 美元- grow in crow develop gradeal grade gration sqal cal ” 之前, 我们发现一个隐藏的Sqalalalal dequlal) 。