Current approaches to text generation largely rely on autoregressive models and maximum likelihood estimation. This paradigm leads to (i) diverse but low-quality samples due to mismatched learning objective and evaluation metric (likelihood vs. quality) and (ii) exposure bias due to mismatched history distributions (gold vs. model-generated). To alleviate these problems, we frame text generation as an offline reinforcement learning (RL) problem with expert demonstrations (i.e., the reference), where the goal is to maximize quality given model-generated histories. We propose GOLD (generation by off-policy learning from demonstrations): an easy-to-optimize algorithm that learns from the demonstrations by importance weighting. Intuitively, GOLD upweights confident tokens and downweights unconfident ones in the reference during training, avoiding optimization issues faced by prior RL approaches that rely on online data collection. According to both automatic and human evaluation, models trained by GOLD outperform those trained by MLE and policy gradient on summarization, question generation, and machine translation. Further, our models are less sensitive to decoding algorithms and alleviate exposure bias.
翻译:目前的文本生成方法主要依赖自动递增模型和最大可能性估计。这一模式导致:(一) 由于学习目标和评价衡量标准(似优质对质量)不匹配而具有多样性但质量低的样本;(二) 由于历史分布不匹配而存在偏差(黄金对模型生成)。为了缓解这些问题,我们将文本生成作为专家演示(即参考)的离线强化学习(RL)问题,目的是最大限度地提高模型生成历史的质量。我们提议GOLD(通过从演示中脱政策学习产生):一种容易优化的算法,通过权重加权从演示中学习。在培训中,GOLD提升了自信符号和下限重量,避免了依赖在线数据收集的前RL方法所面临的优化问题。根据自动和人文评估,GOLD培训的模型比MLE培训的模型和关于合成、问题生成和机器翻译的政策梯度高。此外,我们的模型对解分解算法和减轻偏见暴露不敏感。