MetricX-25与GemSpanEval：谷歌翻译参与WMT25评估共享任务的提交方案 (MetricX-25 and GemSpanEval: Google Translate Submissions to the WMT25 Evaluation Shared Task)

In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.

翻译：本文介绍了我们为统一的WMT25翻译评估共享任务所提交的方案。针对质量分数预测子任务，我们创建了新一代MetricX模型，改进了输入格式与训练协议；针对错误跨度检测子任务，我们开发了新模型GemSpanEval，该模型经训练可预测错误跨度及其严重程度与类别。两个系统均基于最先进的多语言开放权重模型Gemma 3，并利用公开可用的WMT数据进行微调。我们证明，通过将Gemma 3适配为仅编码器架构并添加回归头，MetricX-25能够有效预测MQM与ESA质量分数，且性能显著超越前代模型。另一方面，我们展示的仅解码器GemSpanEval模型在错误跨度检测任务中，可与强基线模型xCOMET（一种仅编码器的序列标注模型）竞争。通过将错误跨度检测构建为生成式任务，我们指导模型同时输出每个预测错误跨度的上下文，从而确保错误跨度的识别具有明确性。