We introduce a new framework, Directional Stimulus Prompting, that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) on downstream tasks. Unlike prior work that manually or automatically finds the optimal prompt for each task, we train a policy LM to generate discrete tokens as directional stimulus of each input, which is a hint/cue such as keywords of an article for summarization. The directional stimulus is then combined with the original input and fed into the LLM to guide its generation toward the desired target. The policy LM can be trained through 1) supervised learning from annotated data and 2) reinforcement learning from offline and online rewards to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. To verify its effectiveness, we apply our framework to summarization and dialogue response generation tasks. Experimental results demonstrate that it can significantly improve LLMs' performance with a small collection of training data: a T5 (780M) trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex (175B)'s performance by 9.0% in ROUGE-Avg scores; only 80 dialogues can boost the combined score by 39.7%, achieving comparable or even better performance than some fully trained models on the MultiWOZ dataset. We have made our code publicly available.
翻译:我们引入了一种新的框架,名为“定向刺激引导”,它使用一种可调节的语言模型(LM)来为黑盒冻结大型语言模型(LLM)提供下游任务的指导。与先前的工作手动或自动找到每个任务的最佳提示不同,我们训练一个策略LM生成离散令牌作为每个输入的定向刺激,这是一个提示/线索,例如摘要的文章关键词。然后将定向刺激与原始输入组合并馈入LLM以指导其生成目标。策略LM可以通过1)从注释数据的监督学习和2)从离线和在线奖励的强化学习来训练,以探索更好地与人类偏好相一致的定向刺激。该框架可以灵活应用于各种LM和任务。为了验证其有效性,我们将我们的框架应用于摘要和对话响应生成任务。实验结果表明,它可以显着提高LLM在少量训练数据的性能:一个在CNN /每日邮报数据集上训练了2,000个样本的T5(780M)将Codex(175B)的性能提高了9.0%。只有80个对话可以将组合分数提高39.7%,在MultiWOZ数据集上实现了与一些完全训练的模型相当甚至更好的性能。我们已经公开了我们的代码。