Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.
翻译:大型语言模型(LLMs)在各种句子级别翻译数据集上与最先进的技术相当竞争。然而,它们在翻译段落和文档方面的能力仍未得到探索,因为在这些情况下进行评估是昂贵且困难的。我们通过严谨的人工评估表明,在18种语言对(例如,日语、波兰语和英语)中,要求Gpt-3.5(text-davinci-003)LLM一次性翻译整个文学段落(例如,来自小说)的结果比标准的逐句翻译具有更高的质量。我们的评估需要大约350小时的注释和分析工作,我们聘请能流利掌握源语言和目标语言的翻译人员,并要求他们提供跨度级别的错误标注以及对哪个系统的翻译更好的喜好判断。我们观察到,文段级别的LLM翻译比句子级别的方法少有翻译错误、语法错误和文体不一致问题。但是,关键性错误仍然存在,包括偶发的内容遗漏问题,因此人类翻译家的干预仍然是必要的,以确保作者的声音得以保持。我们公开发布我们的数据集和错误注释,以促进未来文学文档级翻译评估的研究。