We describe GEMBA, a GPT-based metric for assessment of translation quality, which works both with a reference translation and without. In our evaluation, we focus on zero-shot prompting, comparing four prompt variants in two modes, based on the availability of the reference. We investigate seven versions of GPT models, including ChatGPT. We show that our method for translation quality assessment only works with GPT 3.5 and larger models. Comparing to results from WMT22's Metrics shared task, our method achieves state-of-the-art accuracy in both modes when compared to MQM-based human labels. Our results are valid on the system level for all three WMT22 Metrics shared task language pairs, namely English into German, English into Russian, and Chinese into English. This provides a first glimpse into the usefulness of pre-trained, generative large language models for quality assessment of translations. We publicly release all our code and prompt templates used for the experiments described in this work, as well as all corresponding scoring results, to allow for external validation and reproducibility.
翻译:我们描述了GENMBA, 一种基于GNMBA的翻译质量评估标准,它既使用参考翻译,又不使用。在我们的评估中,我们注重零点提示,根据参考材料的可用性,对两种模式的四个快速变体进行比较。我们调查了七种GPT模型的七种版本,包括ChattGPT。我们显示,我们的翻译质量评估方法只与GPT3.5和更大的模型起作用。比较WMT22的Metrics的共享任务,我们的方法与MQM的人类标签相比,在两种模式中都达到了最先进的准确性。我们的结果对WMT22所有三种WMetrics共享任务语言的系统水平是有效的,即英语对德语、英语对俄语、中文对英语对英语,对英语对英语对英语,对英语对英语对英语,对英语对英语。我们首先审视了预先训练的、具有轮廓的大型语言模型对翻译质量评估的有用性。我们公开发布了我们用于这项工作中描述的实验的所有代码和即快速模板,以及所有对应的评分结果,以便外部验证和重新验证。</s>