Machine learning is shifting towards general-purpose pretrained generative models, trained in a self-supervised manner on large amounts of data, which can then be applied to solve a large number of tasks. However, due to their generic training methodology, these models often fail to meet some of the downstream requirements (e.g., hallucinations in abstractive summarization or style violations in code generation). This raises the important question of how to adapt pre-trained generative models to meet all requirements without destroying their general capabilities ("catastrophic forgetting"). Recent work has proposed to solve this problem by representing task-specific requirements through energy-based models (EBMs) and approximating these EBMs using distributional policy gradients (DPG). Despite its effectiveness, this approach is however limited to unconditional distributions. In this paper, we extend DPG to conditional tasks by proposing Conditional DPG (CDPG). We evaluate CDPG on four different control objectives across three tasks (translation, summarization and code generation) and two pretrained models (T5 and GPT-Neo). Our results show that fine-tuning using CDPG robustly moves these pretrained models closer towards meeting control objectives and -- in contrast with baseline approaches -- does not result in catastrophic forgetting.
翻译:机械学习正在转向一般用途的预先训练的基因化模型,这种模型以自我监督的方式对大量数据进行了培训,然后可以应用这些数据来解决大量的任务。然而,由于这些模型的一般培训方法,这些模型往往不能满足下游的某些要求(例如抽象的抽象归纳的幻觉或代码生成过程中的风格违反),这就提出了如何使经过训练的基因化模型适应所有要求而又不破坏其一般能力(“灾难性的遗忘”)的重要问题。最近的工作提议通过基于能源的模式(EBMs)来代表具体任务的要求,并利用分布政策梯度(DPG)来接近这些EBMs。尽管这些模式是有效的,但这种方法仍然局限于无条件的分布。在本文件中,我们通过提出有条件的DPG(CDPG)方案,将DPG扩大到有条件的任务。我们评估CDPG方案在三个任务(翻译、总结和代码生成)和两个预先训练的模式(T5和GPT-Neo)方面的四项不同的控制目标。我们的结果显示,在更接近的CDPG目标方面,在更严格地调整后,在更严格地将结果上,在更接近于MGGGGGGM控制模型中,这些基准中,我们用了更精确地调整了这些模型。