We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: https://github.com/namisan/exdeep-nmt.
翻译:我们探索了非常深层的变异器模型用于神经机器翻译(NMT)的应用。 使用简单而有效的初始化技术稳定了培训,我们证明,建立标准的变异器模型,最多60个编码器层和12个解码器层是可行的。 这些深层模型比基线的6层模型高出2.5 BLEU,并在WMT14英语-法语(43.8 BLEU和46.4 BLEU与回译)和WMT14英语-德语(30.1 BLEU)上取得了新的最新基准结果。 代码和经过培训的模型将在以下网站公开提供:https://github.com/namisan/exdeep-nmt。