Understanding implicit bias of gradient descent for generalization capability of ReLU networks has been an important research topic in machine learning research. Unfortunately, even for a single ReLU neuron trained with the square loss, it was recently shown impossible to characterize the implicit regularization in terms of a norm of model parameters (Vardi & Shamir, 2021). In order to close the gap toward understanding intriguing generalization behavior of ReLU networks, here we examine the gradient flow dynamics in the parameter space when training single-neuron ReLU networks. Specifically, we discover an implicit bias in terms of support vectors, which plays a key role in why and how ReLU networks generalize well. Moreover, we analyze gradient flows with respect to the magnitude of the norm of initialization, and show that the norm of the learned weight strictly increases through the gradient flow. Lastly, we prove the global convergence of single ReLU neuron for $d = 2$ case.
翻译:对RELU网络的普及能力,理解梯度下降的隐含偏差是机器学习研究的一个重要研究课题。不幸的是,即使是受过平方损失训练的单一RELU神经系统,最近也证明不可能用模型参数规范(Vardi & Shamir, 2021年)来说明隐含的正规化。为了缩小在理解RELU网络引人注意的一般行为方面的差距,我们在这里在培训单中子ReLU网络时考察参数空间的梯度流动动态。具体地说,我们发现在支持矢量方面存在着隐含的偏差,这种偏差在为什么和如何使RELU网络普遍化方面起着关键作用。此外,我们分析了初始化规范的大小,并表明学习重量的规范通过梯度流严格地增加。最后,我们证明了单RELU神经元对美元=2美元案例的全球趋同。