Learned metrics such as BLEURT have in recent years become widely employed to evaluate the quality of machine translation systems. Training such metrics requires data which can be expensive and difficult to acquire, particularly for lower-resource languages. We show how knowledge can be distilled from Large Language Models (LLMs) to improve upon such learned metrics without requiring human annotators, by creating synthetic datasets which can be mixed into existing datasets, requiring only a corpus of text in the target language. We show that the performance of a BLEURT-like model on lower resource languages can be improved in this way.
翻译:近年来,广泛采用BLEURT等衡量标准来评价机器翻译系统的质量,培训这类衡量标准需要昂贵和难以获得的数据,特别是低资源语言的数据。我们表明,如何从大语言模型中提取知识,在不要求人类通知员的情况下改进这种学习的衡量标准,方法是创建合成数据集,将其与现有的数据集混在一起,只要求使用目标语言的文本。我们表明,这样可以改进类似于BLEURT的低资源语言模型的性能。