We introduce a new distributed policy gradient algorithm and show that it outperforms existing reward-aware training procedures such as REINFORCE, minimum risk training (MRT) and proximal policy optimization (PPO) in terms of training stability and generalization performance when optimizing machine translation models. Our algorithm, which we call MAD (on account of using the mean absolute deviation in the importance weighting calculation), has distributed data generators sampling multiple candidates per source sentence on worker nodes, while a central learner updates the policy. MAD depends crucially on two variance reduction strategies: (1) a conditional reward normalization method that ensures each source sentence has both positive and negative reward translation examples and (2) a new robust importance weighting scheme that acts as a conditional entropy regularizer. Experiments on a variety of translation tasks show that policies learned using the MAD algorithm perform very well when using both greedy decoding and beam search, and that the learned policies are sensitive to the specific reward used during training.
翻译:我们引入了新的分布式政策梯度算法,并表明它比REINFORCE(REINFORCE)、最低风险培训(MRT)和准政策优化(PPO)等现有的奖励意识培训程序在优化机器翻译模型时在培训稳定性和一般化表现方面表现得更好。 我们称之为MAD(由于使用加权计算中的平均绝对偏差)的算法已经分发了数据生成器,对工人节点的每个源句子的多个候选人进行抽样抽样,而一个中央学习者则更新了政策。 MAD关键地取决于两个减少差异战略:(1) 有条件的奖励正常化方法,确保每个来源的句子都有正负两方面的奖励翻译范例,(2) 新的强力重重权重计划,作为有条件的诱导器。 对各种翻译任务进行的实验表明,在使用MAD算法学习的政策在使用贪婪解码和波子搜索时效果很好,而所学的政策对培训期间使用的具体奖励十分敏感。