Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks. Taking document-level machine translation (MT) as a testbed, this paper provides an in-depth evaluation of LLMs' ability on discourse modeling. The study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where we investigate the impact of different prompts on document-level translation quality and discourse phenomena; 2) Comparison of Translation Models, where we compare the translation performance of Chat-GPT with commercial MT systems and advanced document-level MT methods; 3) Analysis of Discourse Modelling Abilities, where we further probe discourse knowledge encoded in LLMs and examine the impact of training techniques on discourse modeling. By evaluating a number of benchmarks, we surprisingly find that 1) leveraging their powerful long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain discourse knowledge, even through it may select incorrect translation candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated superior performance and show potential to become a new and promising paradigm for document-level translation. This work highlights the challenges and opportunities of discourse modeling for LLMs, which we hope can inspire the future design and evaluation of LLMs.
 翻译:大型语言模型(LLM)例如Chat-GPT可以为各种自然语言处理(NLP)任务产生连贯、连贯、相关和流畅的回答。以文档级机器翻译为测试平台,本文提供了对LLM在话语建模方面能力的深入评估。该研究关注三个方面:1)话语感知提示的影响,我们研究不同提示对文档级翻译质量和话语现象的影响;2)翻译模型的比较,我们将Chat-GPT的翻译性能与商业MT系统和先进的文档级翻译方法进行了比较;3)话语建模能力的分析,我们进一步探讨了编码在LLMs中的话语知识,并检查了训练技巧对话语建模的影响。通过评估许多基准,我们惊奇地发现:1)利用它们强大的长文本建模能力,ChatGPT的人类评估优于商业MT系统。2)GPT-4展示了强大的解释话语知识的能力,即使它在对比测试中可能选择了不正确的翻译候选项。3)ChatGPT和GPT-4表现出优秀的性能,并显示成为文档级翻译的一种新而有前途的范例的潜力。这项工作突出了LLM的话语建模的挑战和机遇,我们希望这可以激发未来LLM的设计和评估。