Large language models (LLMs) are competitive with the state of the art on a wide range of sentence-level translation datasets. However, their ability to translate paragraphs and documents remains unexplored because evaluation in these settings is costly and difficult. We show through a rigorous human evaluation that asking the Gpt-3.5 (text-davinci-003) LLM to translate an entire literary paragraph (e.g., from a novel) at once results in higher-quality translations than standard sentence-by-sentence translation across 18 linguistically-diverse language pairs (e.g., translating into and out of Japanese, Polish, and English). Our evaluation, which took approximately 350 hours of effort for annotation and analysis, is conducted by hiring translators fluent in both the source and target language and asking them to provide both span-level error annotations as well as preference judgments of which system's translations are better. We observe that discourse-level LLM translators commit fewer mistranslations, grammar errors, and stylistic inconsistencies than sentence-level approaches. With that said, critical errors still abound, including occasional content omissions, and a human translator's intervention remains necessary to ensure that the author's voice remains intact. We publicly release our dataset and error annotations to spur future research on evaluation of document-level literary translation.
翻译:大型语言模型(LLMs)在广泛的句子级翻译数据集上与最先进技术竞争。然而,它们在翻译段落和文档方面的能力仍未被探索,因为在这些环境中进行评估是昂贵且困难的。通过经过严格的人工评估,我们表明要求Gpt-3.5(text-davinci-003)LLM一次翻译整个文学段落(例如,从小说中)所产生的翻译比跨18种语言(例如,将日语,波兰语和英语翻译成和翻译出)进行标准逐句翻译会产生更高质量的翻译。我们的评估大约需要350小时的工作量进行注释和分析,通过招募精通源语言和目标语言的翻译人员,并要求他们提供跨度级别的错误注释以及判断哪个系统的翻译更好。我们观察到,基于文本片段实现的LLM翻译人员比基于句子的方法在语篇级别上出现更少的误译现象,语法错误和文体不一致问题。但是,关键错误仍然存在,包括偶尔的内容省略,人类翻译干预仍然是必要的,以确保作者的声音保持不变。我们公开发布我们的数据集和错误注释,以推动未来针对文档级文学翻译评估的研究。