Recently there has been significant theoretical progress on understanding the convergence and generalization of gradient-based methods on nonconvex losses with overparameterized models. Nevertheless, many aspects of optimization and generalization and in particular the critical role of small random initialization are not fully understood. In this paper, we take a step towards demystifying this role by proving that small random initialization followed by a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, also puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well. Concretely, we focus on the problem of reconstructing a low-rank matrix from a few measurements via a natural nonconvex formulation. In this setting, we show that the trajectory of the gradient descent iterations from small random initialization can be approximately decomposed into three phases: (I) a spectral or alignment phase where we show that that the iterates have an implicit spectral bias akin to spectral initialization allowing us to show that at the end of this phase the column space of the iterates and the underlying low-rank matrix are sufficiently aligned, (II) a saddle avoidance/refinement phase where we show that the trajectory of the gradient iterates moves away from certain degenerate saddle points, and (III) a local refinement phase where we show that after avoiding the saddles the iterates converge quickly to the underlying low-rank matrix. Underlying our analysis are insights for the analysis of overparameterized nonconvex optimization schemes that may have implications for computational problems beyond low-rank reconstruction.
翻译:最近,在理解非碳化损失的梯度方法的趋同性和概括性方面,在理解非碳化损失的梯度方法与超分度模型的趋同性和普及性方面取得了显著的理论进步。然而,优化和概括性的许多方面,特别是小型随机初始化的关键作用,还没有得到完全理解。在本文件中,我们通过证明小规模随机初始化以及一些梯度下降的迭代与广度光谱方法相似,从而朝着这一作用方向迈出了一大步。我们还表明,从小随机初始化到小随机初始化,这种隐性光谱度偏差的偏差大约分分化为三个阶段:(一) 光度或校正阶段,也使梯度下降的偏移进入一个特定的轨道,走向不仅全球最佳,而且非常普遍。具体地,我们集中关注通过自然非碳化的配方的配方,将低位矩阵重新构建一个低位矩阵的问题。在此背景下,从小随机偏移的梯度下降趋势的轨迹可大致分解为三个阶段。(一) 光谱级或校正阶段,我们显示,我们右偏向的轨道的轨偏移的轨值的轨值的轨变变,从而显示了二列的底阶段的底值的轨变。