In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint of the loss function by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among concepts of large margin Softmax, local Lipschitz constraint and curriculum learning by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust Softmax loss in training, where the gradient decay rate increases from over-small to speed up the convergence rate.
翻译:在过去几年里, Softmax 已成为神经网络框架的一个共同组成部分。 在本文中, 在Softmax 中引入了梯度衰减双参数, 以控制培训过程中的概率依赖梯度衰减率。 在对MNIST、 CIFAR- 10/100 和 SVHN 进行培训的各种模型结构进行理论分析和实验后, 我们发现, 总体性能在很大程度上取决于梯度衰减率的提高, 也就是, 梯度随着样本概率的增加, 梯度会降低 。 此外, 小梯度衰减的优化显示了类似的课程学习顺序, 硬样品只有在很容易地说服样本之后才会成为受关注对象, 而精密分离的样品会获得更高的梯度, 以减少阶级间的距离。 根据分析结果, 我们可以提供证据, 大边距 Softmax 将会通过调节依赖概率的梯度衰减率来影响损失功能的当地Lipschitz的制约。 本文提供了一种新视角和理解大边距软度、 利普氏 质衰减 、 度约束 度 度 度 递增 度 和 递增 课程 学习 递增 递增 速度 战略 。