The early phase of training of deep neural networks has a dramatic effect on the local curvature of the loss function. For instance, using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. We ask whether this tendency is connected to the widely observed phenomenon that the choice of the learning rate strongly influences generalization. We first show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM), a measure of the local curvature, from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We highlight that poor final generalization coincides with the trace of the FIM increasing to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples.
翻译:深神经网络的早期培训对损失功能的当地曲线有巨大影响。 例如,使用一个小学习率并不能保证稳定的优化,因为优化轨迹倾向于向损失表面区域倾斜,而地方曲线则日益曲线化。 我们问,这一趋势是否与广泛观察到的现象相关,即选择学习率对一般化有强烈影响。 我们首先表明,从培训开始,随机梯度梯度下降(SGD)就暗含地惩罚渔业信息矩阵(FIM)的痕迹(FIM)(FIM)(FIM)(FIM))(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FI)(FIM)(FIM)(FI)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(FIM)(T)(FIM)(FIM)(FIM)(FIM)(I(I(FIM)(FIM)(FIM)(FIM)(I)(I(I)(I)(I(I(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I)(I))(I)(I)(I)(I)(I)(I))(I))(I))(I)(I)(I)(I)(I)(I(I(I)(I)(I))(I)(I(I(I)(I)(I)))))(I)((I)((I)(I)(I))(I)(I)(I)(T)(I(I)(I)(I)(I)(I)(I)(