We study the power of cross-attention in the Transformer architecture within the context of machine translation. In transfer learning experiments, where we fine-tune a translation model on a dataset with one new language, we find that, apart from the new language's embeddings, only the cross-attention parameters need to be fine-tuned to obtain competitive BLEU performance. We provide insights into why this is the case and further find that limiting fine-tuning in this manner yields cross-lingually aligned type embeddings. The implications of this finding include a mitigation of catastrophic forgetting in the network and the potential for zero-shot translation.
翻译:我们从机器翻译的角度研究变换器结构中交叉注意的力量。在转换学习实验中,我们用一种新语言微调一个数据集的翻译模型,我们发现,除了新语言的嵌入外,只需要对交叉注意参数进行微调,才能取得有竞争力的BLEU性能。我们深入了解为什么情况如此,并进一步发现,以这种方式限制微调会产生跨语言、统一型式的嵌入。这一发现的影响包括减轻网络中灾难性的遗忘和零率翻译的可能性。