Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than $n$ (sample size) neurons when the activation is smooth. First, we prove that as long as the width $m \geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.
翻译:现代神经网络往往非常广泛, 造成大量的记忆和计算成本。 因此, 训练一个更窄的网络非常感兴趣。 但是, 培训狭窄的神经网仍是一项艰巨的任务。 我们问两个理论问题: 狭窄的网络能否具有与宽度相同的强烈的表达性? 如果如此, 损失函数是否呈现出一个良性优化景观? 在这项工作中, 我们为一个比1级网络少于美元( 缩略大小) 神经元的问题提供了部分肯定的答案。 当激活时, 我们考虑在可行区域是良好的地方区域的情况下, 使用一个有限的优化公式, 并且证明每个KKT点( 美元是投入层面) 的宽度是全球最小的, 它的直观性是强大的, 也就是说, 至少有一个全球最小的网络有强烈的表达性网络, 培训损失是零的。 其次, 我们找到一个没有本地分点或马垫点的好的地方区域。 但是, 我们不清楚, 在可行区域是良好的地方, 我们考虑一个有限的优化公式, 并且证明每个 KKT 点都是一个近乎全球最小的最低最小最小的最小的最小的最小的最小的最小值, 。 。 预测的梯值方法会趋化方法会接近于精确的, 的升级的 的 的 的 的模型的模型的模型的计算结果显示 的 的 的 的 的 。