Even for the gradient descent (GD) method applied to neural network training, understanding its optimization dynamics, including convergence rate, iterate trajectories, function value oscillations, and especially its implicit acceleration, remains a challenging problem. We analyze nonlinear models with the logistic loss and show that the steps of GD reduce to those of generalized perceptron algorithms (Rosenblatt, 1958), providing a new perspective on the dynamics. This reduction yields significantly simpler algorithmic steps, which we analyze using classical linear algebra tools. Using these tools, we demonstrate on a minimalistic example that the nonlinearity in a two-layer model can provably yield a faster iteration complexity $\tilde{O}(\sqrt{d})$ compared to $Ω(d)$ achieved by linear models, where $d$ is the number of features. This helps explain the optimization dynamics and the implicit acceleration phenomenon observed in neural networks. The theoretical results are supported by extensive numerical experiments. We believe that this alternative view will further advance research on the optimization of neural networks.
翻译:即使对于应用于神经网络训练的梯度下降(GD)方法,理解其优化动力学,包括收敛速率、迭代轨迹、函数值振荡,尤其是其隐式加速,仍然是一个具有挑战性的问题。我们分析了使用逻辑损失的非线性模型,并表明梯度下降的步骤可简化为广义感知机算法(Rosenblatt, 1958)的步骤,为理解动力学提供了新的视角。这种简化产生了显著更简单的算法步骤,我们使用经典的线性代数工具对其进行分析。利用这些工具,我们在一个最小化示例上证明,两层模型中的非线性可证明地产生更快的迭代复杂度 $\tilde{O}(\sqrt{d})$,而线性模型实现的复杂度为 $Ω(d)$,其中 $d$ 是特征数量。这有助于解释神经网络中观察到的优化动力学和隐式加速现象。理论结果得到了大量数值实验的支持。我们相信,这种替代视角将进一步推动神经网络优化的研究。