Recent developments in machine translation and multilingual text generation have led researchers to adopt trained metrics such as COMET or BLEURT, which treat evaluation as a regression problem and use representations from multilingual pre-trained models such as XLM-RoBERTa or mBERT. Yet studies on related tasks suggest that these models are most efficient when they are large, which is costly and impractical for evaluation. We investigate the trade-off between multilinguality and model capacity with RemBERT, a state-of-the-art multilingual language model, using data from the WMT Metrics Shared Task. We present a series of experiments which show that model size is indeed a bottleneck for cross-lingual transfer, then demonstrate how distillation can help addressing this bottleneck, by leveraging synthetic data generation and transferring knowledge from one teacher to multiple students trained on related languages. Our method yields up to 10.5% improvement over vanilla fine-tuning and reaches 92.6% of RemBERT's performance using only a third of its parameters.
翻译:在机器翻译和多语种文本生成方面的最新发展导致研究人员采用诸如“知识与技术”或“BLEURT”等经过培训的计量标准,将评价视为一个回归问题,并使用XLM-ROBERTA或 mBERT等多语言预培训模式的表述方法。然而,有关任务的研究表明,这些模型在规模大时最为高效,对评估来说成本高且不切实际。我们利用WMT计量共享任务的数据,调查了RemBERT这一最先进的多语言模型的多语种性和模型能力之间的权衡。我们提出了一系列实验,表明模型大小确实是跨语言传输的瓶颈,然后通过利用合成数据生成和将知识从一名教师转让给受过相关语言培训的多名学生,展示蒸馏技术如何有助于解决这一瓶颈问题。我们的方法比香草微调高10.5%,并仅使用三分之一参数达到RembERT业绩的92.6 %。