We revisit the problem of learning a single neuron with ReLU activation under Gaussian input with square loss. We particularly focus on the over-parameterization setting where the student network has $n\ge 2$ neurons. We prove the global convergence of randomly initialized gradient descent with a $O\left(T^{-3}\right)$ rate. This is the first global convergence result for this problem beyond the exact-parameterization setting ($n=1$) in which the gradient descent enjoys an $\exp(-\Omega(T))$ rate. Perhaps surprisingly, we further present an $\Omega\left(T^{-3}\right)$ lower bound for randomly initialized gradient flow in the over-parameterization setting. These two bounds jointly give an exact characterization of the convergence rate and imply, for the first time, that over-parameterization can exponentially slow down the convergence rate. To prove the global convergence, we need to tackle the interactions among student neurons in the gradient descent dynamics, which are not present in the exact-parameterization case. We use a three-phase structure to analyze GD's dynamics. Along the way, we prove gradient descent automatically balances student neurons, and use this property to deal with the non-smoothness of the objective function. To prove the convergence rate lower bound, we construct a novel potential function that characterizes the pairwise distances between the student neurons (which cannot be done in the exact-parameterization case). We show this potential function converges slowly, which implies the slow convergence rate of the loss function.
翻译:我们重新审视了学习单一神经神经的问题, 在 Gaussian 输入下, 使用 ReLU 激活, 并损失平方平方平方平方平方平方。 我们特别侧重于学生网络拥有2美元神经元的超参数设置。 我们证明随机初始梯度下降率与美元左偏( T ⁇ 3 ⁇ ⁇ right) 率的全球趋同性。 这是这个问题的第一个全球趋同结果, 超过精确参数设定值( n=1美元), 梯度下降率享有1美元( \\\ \ Omega (T) ) 的递增率。 也许令人惊讶的是, 我们进一步展示了 $\\ Omega\ left (T ⁇ -3 ⁇ right) 的偏差参数设置。 在超正平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方。 我们方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方平方。