Preference-learning methods for machine translation (MT), such as Direct Preference Optimization (DPO), have shown strong gains but typically rely on large, carefully curated preference triplets and often struggle to generalize beyond their tuning domains. We propose Reinforcement Learning from Teacher-Model Refinement (RLfR), which replaces static triplets with on-policy, actor-conditioned refinements produced by a frozen teacher. At each step, the actor samples candidate translations, the teacher performs a minimal local edit of each draft, and the actor is reinforced to close the gap using a composite reward that combines scaled negative edit distance for lexical and structural fidelity with COMET for semantic adequacy. This formulation yields a stable, model-aware learning signal without requiring explicit preference datasets. Experiments on FLORES-200 (English to German, Spanish, Chinese, Korean, and Japanese) show that RLfR consistently outperforms strong MT-SFT, DPO, and fixed-reference RL baselines, improving semantic quality and entity preservation, and also achieves superior performance under LLM-based judge evaluations.


翻译:机器翻译(MT)中的偏好学习方法,例如直接偏好优化(DPO),已显示出显著的性能提升,但通常依赖于大规模、精心构建的偏好三元组,且往往难以泛化到其调优领域之外。我们提出了基于教师模型精化的强化学习(RLfR),该方法使用由冻结教师模型生成的、基于策略且依赖于行动者状态的动态精化过程,替代了静态的三元组。在每一步中,行动者采样候选翻译,教师对每个草稿进行最小限度的局部编辑,然后行动者通过一个复合奖励得到强化,以缩小差距。该复合奖励结合了用于词汇和结构保真度的缩放负编辑距离与用于语义充分性的COMET评分。这种形式化方法产生了一个稳定、模型感知的学习信号,而无需显式的偏好数据集。在FLORES-200数据集(英语到德语、西班牙语、中文、韩语和日语)上的实验表明,RLfR在语义质量和实体保留方面持续优于强大的MT-SFT、DPO以及固定参考RL基线方法,并且在基于LLM的评判评估中也取得了更优的性能。

0
下载
关闭预览

相关内容

机器翻译,又称为自动翻译,是利用计算机将一种自然语言(源语言)转换为另一种自然语言(目标语言)的过程。它是计算语言学的一个分支,是人工智能的终极目标之一,具有重要的科学研究价值。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等
Top
微信扫码咨询专知VIP会员