In contrast to SGD, adaptive gradient methods like Adam allow robust training of modern deep networks, especially large language models. However, the use of adaptivity not only comes at the cost of extra memory but also raises the fundamental question: can non-adaptive methods like SGD enjoy similar benefits? In this paper, we provide an affirmative answer to this question by proposing to achieve both robust and memory-efficient training via the following general recipe: (1) modify the architecture and make it scale invariant, i.e. the scale of parameter doesn't affect the output of the network, (2) train with SGD and weight decay, and optionally (3) clip the global gradient norm proportional to weight norm multiplied by $\sqrt{\tfrac{2\lambda}{\eta}}$, where $\eta$ is learning rate and $\lambda$ is weight decay. We show that this general approach is robust to rescaling of parameter and loss by proving that its convergence only depends logarithmically on the scale of initialization and loss, whereas the standard SGD might not even converge for many initializations. Following our recipe, we design a scale invariant version of BERT, called SIBERT, which when trained simply by vanilla SGD achieves performance comparable to BERT trained by adaptive methods like Adam on downstream tasks.
翻译:与SGD不同,亚当等适应性梯度方法允许对现代深层网络进行强有力的培训,特别是大型语言模型。然而,适应性方法的使用不仅以额外记忆的代价为代价,而且还提出了根本性问题:SGD等非适应性方法能否享受类似的好处?在本文件中,我们通过建议通过以下总配方实现稳健和记忆效率高的培训,为这一问题提供一个肯定的答案:(1) 修改结构并使其规模变异,即参数的规模不会影响网络的产出,(2) 使用SGD和重量衰减的培训和可选(3) 将全球梯度规范与重量标准成正比乘以$sqrt\tfrac{2\\lambda\neta ⁇ $, 其中美元是学习率,美元是重量衰减。我们表明,这种总体方法对于调整参数和损失的尺度是稳健的,证明其趋同程度仅取决于初始化和损失的大小,而标准SGDD可能不为许多初始化。按照我们的配方,我们设计了一个比重的SGDG,我们在经过培训的SDIRB升级后,我们设计了一个比级的SDIRB的升级的升级的SDBSDBSDB方法,在达到可比较的SDIRBSDBSDBSDBSDBSDBSDBSBSB的升级的升级式的升级式的升级方法。