Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in the training session. Famous works include Momentum, Adam and Hypergradient. Hypergradient is the most special one. Hypergradient achieved adaptation by calculating the derivative of learning rate with respect to cost function and utilizing gradient descent for learning rate. However, Hypergradient is still not perfect. In practice, Hypergradient fail to decrease training loss after learning rate adaptation with a large probability. Apart from that, evidence has been found that Hypergradient are not suitable for dealing with large datesets in the form of minibatch training. Most unfortunately, Hypergradient always fails to get a good accuracy on the validation dataset although it could reduce training loss to a very tiny value. To solve Hypergradient's problems, we propose a novel adaptation algorithm, where learning rate is parameter specific and internal structured. We conduct extensive experiments on multiple network models and datasets compared with various benchmark optimizers. It is shown that our algorithm can achieve faster and higher qualified convergence than those state-of-art optimizers.
翻译:学习率适应是机器学习中最受欢迎的话题。 高级潜伏者用固定的学习率来训练神经神经神经神经工作。 推荐学习率适应是为了通过调整培训课程的步数大小来加速培训过程。 著名的作品包括运动、 Adam 和 Hypergradient 。 超梯度是最特别的作品。 超梯度通过计算学习率衍生物的成本函数和梯度下降率来适应, 然而, 超梯度仍然不完美。 在实践中, 超梯度者在学习率调整后没有减少培训损失, 概率很大。 除此之外, 已经发现超梯度不适合以微型批量培训的形式处理大日期板。 最不幸的是, 超梯度者总是无法在验证数据集上取得良好的准确性, 尽管它可以将培训损失降低到极小的价值。 为了解决超梯度问题, 我们建议一种新的适应算法, 学习率是特定参数和内部结构的。 我们对多个网络模型和数据集进行了广泛的实验, 与各种基准优化者相比, 显示我们的算法能够更快和更高程度的优化。