The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.
翻译:变换器被广泛用于自然语言处理任务 。 但是, 要培训变换器, 通常需要精心设计的学习速率暖化阶段, 这对最终性能至关重要, 但会减慢优化速度, 并带来更多超参数调试 。 在本文中, 我们首先从理论上研究学习速率热化阶段为何至关重要, 并显示层级正常化事项的位置 。 具体地说, 我们用原始设计的 LN 变换器在初始化时证明, 原始设计的LN 变换器在初始化时, 使剩余区块之间的层级正常化, 参数的预期梯度很大。 因此, 使用这些梯度上的大学习速率使培训变得不稳定 。 变暖阶段对避免这一问题非常有帮助 。 另一方面, 我们的理论还表明, 如果将层级正常化置于剩余区块( 现称为LN 前变换器), 梯度在初始化时已经很好地存在。 这促使我们去除LN 变换器前 训练的暖化阶段。 我们的实验中显示, 前变变换器的变压得比级基线, 。 而高级则需要高级的试级在温度级的试验中, 低级的基线下, 。