The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
翻译:主导序列转换模型以复杂的重复或进化神经网络为基础,以编码器脱coder-decoder配置为主。最有效果的模型还将编码器和解码器通过关注机制连接起来。我们建议了一个新的简单的网络结构,即完全基于关注机制的变异器,完全避免重现和卷发。在两个机器翻译任务上进行的实验显示,这些模型质量优,同时更加平行,培训时间要少得多。我们的模型在WMT 2014 英文到德文翻译任务上达到了28.4 BLEU,改进了现有的最佳结果,包括由2个BLEU组成的组合。在WMT 2014 英文到法文翻译任务上,我们的模型在八个GPU培训了3.5天之后,确定了新的BLEU最新分数为41.8,这是最佳模型培训成本的一小部分。我们显示,变异器通过成功地将它应用到英语选区,用大而有限的培训数据进行分类,从而对其他任务进行了很好的概括。