Reinforcement learning (RL) has been widely used in text generation to alleviate the exposure bias issue or to utilize non-parallel datasets. The reward function plays an important role in making RL training successful. However, previous reward functions are typically task-specific and sparse, restricting the use of RL. In our work, we propose a task-agnostic approach that derives a step-wise reward function directly from a model trained with teacher forcing. We additionally propose a simple modification to stabilize the RL training on non-parallel datasets with our induced reward function. Empirical results show that our method outperforms self-training and reward regression methods on several text generation tasks, confirming the effectiveness of our reward function.
翻译:强化学习(RL)在生成文本时被广泛使用,以缓解暴露偏差问题或使用非平行数据集。奖励功能在成功进行RL培训方面起着重要作用。然而,以往的奖励功能通常是任务特定和稀少的,限制了RL的使用。我们在工作中建议采用任务不可知性方法,直接从教师强迫培训模式中产生渐进式的奖励功能。我们还提出简单修改,以稳定关于非平行数据集的RL培训与我们诱发的奖励功能。经验性结果显示,我们的方法优于若干文本生成任务中的自我培训和奖赏回归方法,确认了我们奖励功能的有效性。