《关于可分离的数据的梯级后裔隐含的隐含偏见》 (The Implicit Bias of Gradient Descent on Separable Data)

from arxiv, Journal version (previous version appeared as conference paper in ICLR ). Main improvements: We proved measure zero case for main theorem (with implication for the rates), and the multi-class case. Both were not covered in previous version

We show that gradient descent on an unregularized logistic regression problem, for linearly separable datasets, converges to the direction of the max-margin (hard margin SVM) solution. The result generalizes also to other monotone decreasing loss functions with an infimum at infinity, to multi-class problems, and to training a weight layer in a deep network in a certain restricted setting. Furthermore, we show this convergence is very slow, and only logarithmic in the convergence of the loss itself. This can help explain the benefit of continuing to optimize the logistic or cross-entropy loss even after the training error is zero and the training loss is extremely small, and, as we show, even if the validation loss increases. Our methodology can also aid in understanding implicit regularization in more complex models and with other optimization methods.

翻译：我们发现,在不正规的后勤回归问题上,对于线性分离的数据集而言,梯度的下降与最大差值(硬差SVM)解决方案的方向一致。结果还概括地表明,其他单质减缩损失功能在无限度上是最小的,多级问题,在某种限制环境下在深层网络中培训一个重量层。此外,我们显示,这种趋同非常缓慢,在损失本身的趋同中只有对数。这可以帮助解释即使在培训错误为零,培训损失也极小,而且我们表明,即使验证损失增加,我们的方法也可以帮助理解在更复杂模式中和其他优化方法中隐含的正规化。