The dynamics of Deep Linear Networks (DLNs) is dramatically affected by the variance $\sigma^2$ of the parameters at initialization $\theta_0$. For DLNs of width $w$, we show a phase transition w.r.t. the scaling $\gamma$ of the variance $\sigma^2=w^{-\gamma}$ as $w\to\infty$: for large variance ($\gamma<1$), $\theta_0$ is very close to a global minimum but far from any saddle point, and for small variance ($\gamma>1$), $\theta_0$ is close to a saddle point and far from any global minimum. While the first case corresponds to the well-studied NTK regime, the second case is less understood. This motivates the study of the case $\gamma \to +\infty$, where we conjecture a Saddle-to-Saddle dynamics: throughout training, gradient descent visits the neighborhoods of a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a sparse global minimum. We support this conjecture with a theorem for the dynamics between the first two saddles, as well as some numerical experiments.
翻译:深线线网(DLN) 的动态受到以下差异的极大影响: 初始化时参数的 美元= gigma2$2美元; 初始化时参数的 美元=0美元。 对于宽度的 DLN 美元,我们展示了一个阶段过渡 w.r.t. 我们展示了一个阶段性过渡 w.t. 差异的 $gamma$2=w\ ⁇ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\英特}$$美元; 差异很大差异($ gamma < 1美元), $thetta_0$非常接近一个全球最低值, 但远离任何战备点, 和小差异值 1, 。