There has been recent success in pre-training on monolingual data and fine-tuning on Machine Translation (MT), but it remains unclear how to best leverage a pre-trained model for a given MT task. This paper investigates the benefits and drawbacks of freezing parameters, and adding new ones, when fine-tuning a pre-trained model on MT. We focus on 1) Fine-tuning a model trained only on English monolingual data, BART. 2) Fine-tuning a model trained on monolingual data from 25 languages, mBART. For BART we get the best performance by freezing most of the model parameters, and adding extra positional embeddings. For mBART we match or outperform the performance of naive fine-tuning for most language pairs with the encoder, and most of the decoder, frozen. The encoder-decoder attention parameters are most important to fine-tune. When constraining ourselves to an out-of-domain training set for Vietnamese to English we see the largest improvements over the fine-tuning baseline.
翻译:最近,在单语数据培训前和机器翻译(MT)的微调方面取得了成功,但目前还不清楚如何最佳地利用预先培训的模式来完成特定的MT任务。本文件调查了冻结参数的利弊,并在微调预先培训的MT模型时添加了新的参数。我们的重点是:(1) 微调只受过英语单语数据培训的模型,BART。(2) 微调从25种语言(mBART)中经过单一语言数据培训的模型。对于BART来说,我们通过冻结大部分模型参数和增加额外定位嵌入来取得最佳的性能。对于MBART来说,我们匹配或超过大多数语言配对的天性微调的性能,与编码器和大多数解码器被冻结的功能相匹配或超过。编码器分解码器的注意参数对于微调最为重要。当我们限制自己接受为越南人到英语的外部培训时,我们看到在微调底线上的最大改进。