Can pre-trained BERT for one language and GPT for another be glued together to translate texts? Self-supervised training using only monolingual data has led to the success of pre-trained (masked) language models in many NLP tasks. However, directly connecting BERT as an encoder and GPT as a decoder can be challenging in machine translation, for GPT-like models lack a cross-attention component that is needed in seq2seq decoders. In this paper, we propose Graformer to graft separately pre-trained (masked) language models for machine translation. With monolingual data for pre-training and parallel data for grafting training, we maximally take advantage of the usage of both types of data. Experiments on 60 directions show that our method achieves average improvements of 5.8 BLEU in x2en and 2.9 BLEU in en2x directions comparing with the multilingual Transformer of the same size.
翻译:对于一种语言和另一种语言的GPT,经过预先培训的BERT能够粘合在一起翻译文本吗?仅使用单一语言数据的自我监督培训导致许多NLP任务中培训前(制成)语言模型的成功。然而,将BERT作为编码器和GPT作为解码器直接连接起来,在机器翻译方面可能具有挑战性,因为类似GPT的模型缺乏后继2Seq decoders所需要的交叉注意部分。在本文中,我们建议Graeder在机器翻译方面分别采用经过培训前(制成)语言模型。在培训前采用单一语言数据,在滚动培训中采用平行数据,我们最大限度地利用了这两种数据的使用。60个方向的实验表明,我们的方法在x2en中实现了5.8 BLEU的平均改进,在en2en中实现了2.9 BLEU值,在正2x方向上实现了与相同大小的多语言变换器相比的平均改进。