This work is substituted by the paper in arXiv:2011.14066. Stochastic gradient descent is the de facto algorithm for training deep neural networks (DNNs). Despite its popularity, it still requires fine tuning in order to achieve its best performance. This has led to the development of adaptive methods, that claim automatic hyper-parameter optimization. Recently, researchers have studied both algorithmic classes via toy examples: e.g., for over-parameterized linear regression, Wilson et. al. (2017) shows that, while SGD always converges to the minimum-norm solution, adaptive methods show no such inclination, leading to worse generalization capabilities. Our aim is to study this conjecture further. We empirically show that the minimum weight norm is not necessarily the proper gauge of good generalization in simplified scenaria, and different models found by adaptive methods could outperform plain gradient methods. In practical DNN settings, we observe that adaptive methods can outperform SGD, with larger weight norm output models, but without necessarily reducing the amount of tuning required.
翻译:本文在arXiv:2011.14066中用论文取代了这项工作。 软性梯度下降是培训深神经网络(DNNs)的实际算法。尽管它很受欢迎,但仍然需要微调才能达到最佳性能。这导致了适应方法的发展,要求自动超参数优化。最近,研究人员通过玩具的例子研究了两种算法类别:例如,对于过于分化的线性回归,Wilson等人(2017年)表明,尽管SGD总是会接近最低温度的解决方案,但适应方法没有表现出这种倾向,导致更差的概括性能力。我们的目标是进一步研究这一预测。我们从经验上表明,最低重量标准不一定是简化的系统内良好一般化的适当衡量标准,而适应方法发现的不同模型可能比普通的梯度方法要差。在实用的DNNN环境中,我们发现适应方法可以比SGD(SD)更重的标准输出模型更强,但不一定减少所需的调量。