The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.
翻译:以梯度为基础的优化算法的隐含偏差被推测为现代深层次学习成功的一个主要因素。 在这项工作中,我们调查了在两层完全连通的神经网络中梯度流和梯度下降的隐含偏差,当培训数据几乎为垂直的ReLU激活,这是高维数据的共同特性。对于梯度流,我们利用最近关于同质神经网络的隐含偏差的工作,以表明无源的梯度流产生一个最多为两个等级的神经网络。此外,这个网络是一个$ell_2$-max-margin解决方案(在参数空间中),并且有一个直线决定边界,与近似负负负负线线线线线直线预测器相对。对于梯度下降,只要随机初始化差异小到足够小,我们就表明,单步梯度下降足以大幅度降低网络的级别,在整个培训过程中,等级仍然很小。我们提供实验,表明小初始规模对于查找低端的梯度下降的神经网络很重要。