Directly training a document-to-document (Doc2Doc) neural machine translation (NMT) via Transformer from scratch, especially on small datasets usually fails to converge. Our dedicated probing tasks show that 1) both the absolute position and relative position information gets gradually weakened or even vanished once it reaches the upper encoder layers, and 2) the vanishing of absolute position information in encoder output causes the training failure of Doc2Doc NMT. To alleviate this problem, we propose a position-aware Transformer (P-Transformer) to enhance both the absolute and relative position information in both self-attention and cross-attention. Specifically, we integrate absolute positional information, i.e., position embeddings, into the query-key pairs both in self-attention and cross-attention through a simple yet effective addition operation. Moreover, we also integrate relative position encoding in self-attention. The proposed P-Transformer utilizes sinusoidal position encoding and does not require any task-specified position embedding, segment embedding, or attention mechanism. Through the above methods, we build a Doc2Doc NMT model with P-Transformer, which ingests the source document and completely generates the target document in a sequence-to-sequence (seq2seq) way. In addition, P-Transformer can be applied to seq2seq-based document-to-sentence (Doc2Sent) and sentence-to-sentence (Sent2Sent) translation. Extensive experimental results of Doc2Doc NMT show that P-Transformer significantly outperforms strong baselines on widely-used 9 document-level datasets in 7 language pairs, covering small-, middle-, and large-scales, and achieves a new state-of-the-art. Experimentation on discourse phenomena shows that our Doc2Doc NMT models improve the translation quality in both BLEU and discourse coherence. We make our code available on Github.
翻译:通过变换器直接培训文档到文档(Doc2Doc) 神经机翻译(NMT), 特别是在小数据集上, 通常无法合并。 我们的专用检测任务显示 1) 绝对位置和相对位置信息在到达上编码器层后逐渐变弱, 甚至消失; 2) 在编码器输出中消失绝对位置信息导致 Doc2 Doccc NMT 培训失败。 为了缓解这个问题, 我们提议了 定位变异器( P- Transformationer), 以在自控和交叉处理中加强绝对和相对位置信息。 具体地说, 我们把绝对位置信息, 即定位和相对位置信息, 一旦到达上编码器层, 就会逐渐变弱化, 并且通过简单有效的添加操作操作, 自动变异端数据变码, 将文件变异端数据变码- 文件变异( Order2) 和变变变码显示一个文件变异的快速变异数据源, 将文件变换为 NCMT- deal- deal-deal- deal- demode- demodeal- demodeal- demode- demode- demodeal- demode- demode- demodal- demodal- demodal- demodal- demodal- cude- cude- smadal- cude- cude- 。