We introduce a new framework, Directional Stimulus Prompting, that uses a tuneable language model (LM) to provide guidance for the black-box frozen large language model (LLM) on downstream tasks. Unlike prior work that manually or automatically finds the optimal prompt for each task, we train a policy LM to generate discrete tokens as ``directional stimulus'' of each input, which is a hint/cue such as keywords of an article for summarization. The directional stimulus is then combined with the original input and fed into the LLM to guide its generation toward the desired target. The policy LM can be trained through 1) supervised learning from annotated data and 2) reinforcement learning from offline and online rewards to explore directional stimulus that better aligns LLMs with human preferences. This framework is flexibly applicable to various LMs and tasks. To verify its effectiveness, we apply our framework to summarization and dialogue response generation tasks. Experimental results demonstrate that it can significantly improve LLMs' performance with a small collection of training data: a T5 (780M) trained with 2,000 samples from the CNN/Daily Mail dataset improves Codex (175B)'s performance by 7.2% in ROUGE-Avg scores; 500 dialogues boost the combined score by 52.5%, achieving comparable or even better performance than fully trained models on the MultiWOZ dataset.
翻译:我们引入了一个新的框架“ 方向刺激”, 即“ 方向刺激”, 使用一种可调适的语言模型(LM) 为黑盒冻结的大语言模型(LLM) 在下游任务中提供指导。 与以前手工或自动发现每项任务最佳及时性的工作不同, 我们培训了政策 LM, 以生成离散的象征物, 作为每项输入的“ 方向刺激”, 这是文章摘要化的关键词。 然后, 方向刺激与原始输入相结合, 并输入LLM, 以引导其生成达到预期目标。 政策LM 可以通过:(1) 监督学习附加数据;(2) 从离线和在线奖励中强化学习, 探索方向刺激, 更好地将LMS与人类喜好相匹配。 这个框架可灵活地适用于各种LMs和任务。 为了验证其有效性, 我们运用了我们的框架来概括和对话生成任务。 实验结果显示, 其模型可以显著改善LMS的性能, 小规模的培训数据收集: T5 (780M), 在CNN/ Daily MASEB 联合升级的5 数据评分数为5 改进了5 数据, 改进了5 。