Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning perspective. It further enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods. On standard supervised tasks where MLE prevails, our approach also achieves competitive performance and stability by training text generation from scratch.
翻译:最大可能性估算(MLE)是培训文本生成模型的主要算法。 这一范式依赖于直接监督范例,它不适用于许多应用,例如产生对抗性攻击或产生控制语言模型的快感等。另一方面,强化学习(RL)提供了更为灵活的解决方案,允许用户插入任意任务计量作为奖励。然而,以往的文本生成(如政策梯度(政策性RL)和Q-学习(非政策性RL))的RL算法往往臭名昭著地低效或不稳定,因为大量顺序空间和仅从序列末端收到的微薄奖励而进行培训。在本文件中,我们从软的学习角度为文本生成引入新的RL(RL)公式。它进一步使我们能够从最新的RL进步(如路径一致性学习)中汲取最新进展,将最佳的在/在政策上更新和在微薄的奖励中学习。我们将这一方法应用于范围广泛的任务,包括学习噪音/负性攻击,以及迅速生成。实验显示我们的方法始终超越了我们的方法,即从任务-专门化的演算法和以往的版本中,通过监督性压实性操作方法,从而实现我们生成的压低的版本。