Recurrent models have been dominating the field of neural machine translation (NMT) for the past few years. Transformers \citep{vaswani2017attention}, have radically changed it by proposing a novel architecture that relies on a feed-forward backbone and self-attention mechanism. Although Transformers are powerful, they could fail to properly encode sequential/positional information due to their non-recurrent nature. To solve this problem, position embeddings are defined exclusively for each time step to enrich word information. However, such embeddings are fixed after training regardless of the task and the word ordering system of the source or target language. In this paper, we propose a novel architecture with new position embeddings depending on the input text to address this shortcoming by taking the order of target words into consideration. Instead of using predefined position embeddings, our solution \textit{generates} new embeddings to refine each word's position information. Since we do not dictate the position of source tokens and learn them in an end-to-end fashion, we refer to our method as \textit{dynamic} position encoding (DPE). We evaluated the impact of our model on multiple datasets to translate from English into German, French, and Italian and observed meaningful improvements in comparison to the original Transformer.
翻译:过去几年来, 经常模式一直占据神经机翻译( NMT) 领域 。 变换器 \ citep{ vaswani2017 atention} 已经通过提议一个依靠反馈前主干和自我注意机制的新结构, 彻底改变了它。 虽然变换器是强大的, 但由于非经常性质, 它们可能无法正确编码序列/ 定位信息 。 为了解决这个问题, 定位嵌入是专门为每个时间步骤而定义的, 以丰富文字信息 。 但是, 这样的嵌入在训练后就固定了, 不论源语言或目标语言的任务和字订购系统 。 在本文中, 我们提出一个新的结构, 其新位置嵌入的位置取决于输入文本, 以通过目标字的顺序来弥补这一缺陷 。 我们的解决方案不是使用预设位置嵌入式嵌入, 而是使用新的嵌入式来改进每个词的位置 。 由于我们没有指定源符号的位置, 并且以最终的方式学习它们。 我们指我们的方法是将原版的模型, 翻译到原版数据 。