Large language models (LLMs) have demonstrated remarkable potential in handling multilingual machine translation (MMT). In this paper, we systematically investigate the advantages and challenges of LLMs for MMT by answering two questions: 1) How well do LLMs perform in translating a massive number of languages? 2) Which factors affect LLMs' performance in translation? We evaluate popular LLMs, including XGLM, OPT, BLOOMZ, and ChatGPT, on 102 languages. Our empirical results show that even the best model ChatGPT still lags behind the supervised baseline NLLB in 83.33% of translation directions. Through further analysis, we discover that LLMs exhibit new working patterns when used for MMT. First, prompt semantics can surprisingly be ignored when given in-context exemplars, where LLMs still show strong performance even with unreasonable prompts. Second, cross-lingual exemplars can provide better task instruction for low-resource translation than exemplars in the same language pairs. Third, we observe the overestimated performance of BLOOMZ on dataset Flores-101, indicating the potential risk when using public datasets for evaluation.
翻译:大型语言模型(LLMs)在处理多语言机器翻译(MMT)方面表现出了极大的潜力。在本文中,我们通过回答两个问题系统地研究了LLMs在MMT中的优势和挑战:1)LLMs在翻译大量语言方面表现如何?2)哪些因素影响LLMs在翻译中的表现?我们在102种语言上评估了流行的LLMs,包括XGLM、OPT、BLOOMZ和ChatGPT。我们的实证结果表明,即使最好的模型ChatGPT,它在83.33%的翻译方向上仍然落后于有监督的基线NLLB。通过进一步分析,我们发现LLMs在用于MMT时展现出新的工作模式。首先,在场景例子给出的情况下,提示语义可以出乎意料地被忽略,LLMs即使在没有合理提示的情况下仍然表现出强大的性能。其次,跨语言的例子比同语言对的例子可以更好地为低资源翻译提供任务指导。第三,我们发现BLOOMZ在数据集Flores-101上被高估的表现,这表明在使用公共数据集进行评估时存在潜在风险。