In this paper, we present our submissions to the unified WMT25 Translation Evaluation Shared Task. For the Quality Score Prediction subtask, we create a new generation of MetricX with improvements in the input format and the training protocol, while for the Error Span Detection subtask we develop a new model, GemSpanEval, trained to predict error spans along with their severities and categories. Both systems are based on the state-of-the-art multilingual open-weights model Gemma 3, fine-tuned on publicly available WMT data. We demonstrate that MetricX-25, adapting Gemma 3 to an encoder-only architecture with a regression head on top, can be trained to effectively predict both MQM and ESA quality scores, and significantly outperforms its predecessor. Our decoder-only GemSpanEval model, on the other hand, we show to be competitive in error span detection with xCOMET, a strong encoder-only sequence-tagging baseline. With error span detection formulated as a generative task, we instruct the model to also output the context for each predicted error span, thus ensuring that error spans are identified unambiguously.
翻译:本文介绍了我们为统一的WMT25翻译评估共享任务所提交的方案。针对质量分数预测子任务,我们创建了新一代MetricX模型,改进了输入格式与训练协议;针对错误跨度检测子任务,我们开发了新模型GemSpanEval,该模型经训练可预测错误跨度及其严重程度与类别。两个系统均基于最先进的多语言开放权重模型Gemma 3,并利用公开可用的WMT数据进行微调。我们证明,通过将Gemma 3适配为仅编码器架构并添加回归头,MetricX-25能够有效预测MQM与ESA质量分数,且性能显著超越前代模型。另一方面,我们展示的仅解码器GemSpanEval模型在错误跨度检测任务中,可与强基线模型xCOMET(一种仅编码器的序列标注模型)竞争。通过将错误跨度检测构建为生成式任务,我们指导模型同时输出每个预测错误跨度的上下文,从而确保错误跨度的识别具有明确性。