The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. We introduce an architecture based entirely on convolutional neural networks. Compared to recurrent models, computations over all elements can be fully parallelized during training and optimization is easier since the number of non-linearities is fixed and independent of the input length. Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.
翻译:通过经常性神经网络绘制一个输入序列的顺序,以绘制一个可变长输出序列的输入序列。我们引入了一个完全以进化神经网络为基础的结构。与经常性模型相比,所有元素的计算在培训和优化期间可以完全平行,因为非线性的数量是固定的,与输入长度无关。我们使用门式线性单元可以缓解梯度传播,我们为每个解码层配备一个单独的注意模块。我们比WMT'14英语-德语和WMT'14英语-法语翻译的深度LSTM设置的准确性要高得多,以更快的速度在GPU和CPU上。