Controlled text generation tasks such as unsupervised text style transfer have increasingly adopted the use of Reinforcement Learning (RL). A major challenge in applying RL to such tasks is the sparse reward, which is available only after the full text is generated. Sparse rewards, combined with a large action space make RL training sample-inefficient and difficult to converge. Recently proposed reward-shaping strategies to address this issue have shown only negligible gains. In contrast, this work proposes a novel approach that provides dense rewards to each generated token. We evaluate our approach by its usage in unsupervised text style transfer. Averaged across datasets, our style transfer system improves upon current state-of-art systems by 21\% on human evaluation and 12\% on automatic evaluation. Upon ablated comparison with the current reward shaping approach (the `roll-out strategy'), using dense rewards improves the overall style transfer quality by 22\% based on human evaluation. Further the RL training is 2.5 times as sample efficient, and 7 times faster.
翻译:不受监督的文本样式传输等受控制的文本生成任务越来越多地采用强化学习(RL) 。 应用RL来完成这些任务的一个主要挑战是微薄的奖励,只有生成完整文本后才能获得。 粗糙的奖励,加上巨大的行动空间,使得RL培训的抽样效率低,难以汇合。 最近提出的解决这一问题的奖赏分配战略只显示出微不足道的收益。 与此相反, 这项工作提出了一种新的方法, 向每个生成的标牌提供密集的奖赏。 我们通过在未经监督的文本样式传输中使用它来评估我们的方法。 平均而言, 我们的样式转移系统在人类评价方面改进了目前最先进的系统, 21°+++++12++++++12+++++++++++++++++++++++++++++++22++++++++++22++++++++++ 人类评价。 此外,RL培训是样本效率的2.5倍, 速度加快了7倍。