Large Language Models (LLMs) are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.
翻译:大型语言模型因其卓越的能力备受欢迎,但需要对模型进行特定的微调或任务特定的提示工程,这可能会影响它们的泛化能力。我们提出了UPRISE(Universal Prompt Retrieval for Improving zero-Shot Evaluation),它通过调整轻量级且通用的提示检索器,自动检索给定的零样本任务输入的提示。具体地,我们展示了跨任务和跨模型场景中的通用性:检索器在多种任务上进行调整,但在未见过的任务类型上进行测试;我们使用一个较小的冻结LLM,GPT-Neo-2.7B,在调整检索器方面,但在不同的比它大得多的LLM上测试检索器,如BLOOM-7.1B、OPT-66B和GPT3-175B。此外,我们还展示了UPRISE在ChatGPT的实验中缓解了幻觉问题,这表明它有可能提高即使是最强的LLM的表现。我们的模型和代码可在https://github.com/microsoft/LMOps上获取。