深线网络动态动态:由初始化缩放和L2正规化所引出低射线比量 (Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization Scale and L2 Regularization)

For deep linear networks (DLN), various hyperparameters alter the dynamics of training dramatically. We investigate how the rank of the linear map found by gradient descent is affected by (1) the initialization norm and (2) the addition of $L_{2}$ regularization on the parameters. For (1), we study two regimes: (1a) the linear/lazy regime, for large norm initialization; (1b) a \textquotedbl saddle-to-saddle\textquotedbl{} regime for small initialization norm. In the (1a) setting, the dynamics of a DLN of any depth is similar to that of a standard linear model, without any low-rank bias. In the (1b) setting, we conjecture that throughout training, gradient descent approaches a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a minimal rank global minimum. We support this conjecture with a partial proof and some numerical experiments. For (2), we show that adding a $L_{2}$ regularization on the parameters corresponds to the addition to the cost of a $L_{p}$-Schatten (quasi)norm on the linear map with $p=\frac{2}{L}$ (for a depth-$L$ network), leading to a stronger low-rank bias as $L$ grows. The effect of $L_{2}$ regularization on the loss surface depends on the depth: for shallow networks, all critical points are either strict saddles or global minima, whereas for deep networks, some local minima appear. We numerically observe that these local minima can generalize better than global ones in some settings.

翻译：对于深度线性网络(DLN),各种超光度参数会大大改变培训的动态。我们调查梯度下降发现线性地图的级别如何受到以下因素的影响:(1) 初始化规范和(2) 在参数上增加$L%2的正规化。对于(1),我们研究两种制度:(a) 线性/纬度制度,用于大型常规初始化;(b) 用于小型初始化规范的textblock-saddle\ textclocklation 制度。在(1a) 设置中,任何深度的DLN的动态都类似于标准直线性模型的动态,而没有任何低级别网络的偏差偏差。在(1b) 设置中,我们推测在整个培训过程中,梯度下降的马鞍都接近一系列的马鞍,每个相当于升级的直线性地图,我们用部分证据和一些数字实验来支持这一猜想。(2) 在参数上添加$l+$2,任何深度的最小值的RNRNRN, 或直线性网络的直值值值值值值值值值值值。(q) 在深度地图上,这些直值网络上,直值直值直值直线性网络上,直值直值。