Descent methods for deep networks are notoriously capricious: they require careful tuning of step size, momentum and weight decay, and which method will work best on a new benchmark is a priori unclear. To address this problem, this paper conducts a combined study of neural architecture and optimisation, leading to a new optimiser called Nero: the neuronal rotator. Nero trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning. Also, Nero's memory footprint is ~ square root that of Adam or LAMB. Nero combines two ideas: (1) projected gradient descent over the space of balanced networks; (2) neuron-specific updates, where the step size sets the angle through which each neuron's hyperplane turns. The paper concludes by discussing how this geometric connection between architecture and optimisation may impact theories of generalisation in deep learning.
翻译:深层网络的底部方法臭名昭著地反复无常:它们需要仔细调整职级大小、动力和重量衰减,对于新基准而言,哪种方法最有效是先验的,并不清楚。为了解决这个问题,本文件对神经结构和优化进行了综合研究,从而产生了一个新的优化器,称为神经转动器。Nero在没有动力或重量衰减的情况下可靠地进行火车,在Adam和SGD失灵的情况下工作,几乎不需要调整学习速率。此外,Nero的记忆足迹是亚当或LAMB的平方根。Nero综合了两个想法:(1) 预测平衡网络空间的梯度下降;(2) 神经特异性更新,其中步数设定了每个神经元超高平板旋转的角度。文件最后讨论了结构与优化之间的几何联系如何影响深层学习中的泛化理论。