The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.
翻译:大量预先培训的模型正在改变机器学习研究和实践的格局,从培训从零到零到微调模式。虽然在一些应用中,目标是“使”预先培训的分布转向优先产出,但在另一些应用中,目标是引导它向抽样空间的不同分布方向发展。出现了两个主要范例来应对这一挑战:奖励最大化(RM)和最近的配对(DM)。RM采用标准强化学习(RL)技术,如政策梯度等,逐步增加奖励信号。DM规定首先明确该模型精确调整到近似的目标分布。在这里,我们探讨了两种模式之间的理论联系,并表明为RMM开发的KL控制等方法也可以被解释为属于DM。我们还注意到,虽然DM不同于RM,但它也可能受到类似的培训困难,例如梯度差异很大。我们利用两种模式之间的联系,将基线概念引入DM方法中。我们实证了在可控制的文本系列中增加一个基线的好处,即从可控制性语言满意度、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、可衡量性、性、可衡量性、可衡量性、可衡量性能等的语文的文本。