In this paper, we propose a simple yet effective method to stabilize extremely deep Transformers. Specifically, we introduce a new normalization function (DeepNorm) to modify the residual connection in Transformer, accompanying with theoretically derived initialization. In-depth theoretical analysis shows that model updates can be bounded in a stable way. The proposed method combines the best of two worlds, i.e., good performance of Post-LN and stable training of Pre-LN, making DeepNorm a preferred alternative. We successfully scale Transformers up to 1,000 layers (i.e., 2,500 attention and feed-forward network sublayers) without difficulty, which is one order of magnitude deeper than previous deep Transformers. Remarkably, on a multilingual benchmark with 7,482 translation directions, our 200-layer model with 3.2B parameters significantly outperforms the 48-layer state-of-the-art model with 12B parameters by 5 BLEU points, which indicates a promising scaling direction.
翻译:在本文中,我们提出了一个简单而有效的方法来稳定极深的变异器。 具体地说, 我们引入了新的正常化功能( 深度Norm) 来修改变异器中的剩余连接, 并辅之以理论上的初始化。 深入的理论分析表明, 模型更新可以以稳定的方式被捆绑起来。 拟议的方法将两个世界的最好之处结合起来, 即: 液态后和液态前稳定的训练, 使深点内温成为首选的替代物。 我们成功地将变异器提升到1,000层( 即 2500 个注意和进料前网络子层), 没有困难, 比以前的深层变异器更深的一等级。 值得注意的是, 我们的200级模型与3.2B参数的多语言基准大大超越了48级的艺术状态模型, 12B值参数增加了5 BLEU点, 这表明一个有希望的缩放方向。