By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.
翻译:根据自然语言指令,大型语言模型(LLMS)作为通用计算机表现出了令人印象深刻的能力。然而,任务性能在很大程度上取决于用于指导模型的快速质量,而最有效的快速是人类手工制作的。在古典方案合成和人类快速工程方法的启发下,我们建议自动快速工程师(APE)进行自动教学的生成和选择。在我们的方法中,我们通过搜索一个LM提出的一组教学候选人来将教学作为“程序”优化,以便最大限度地实现所选的得分数功能。为了评价所选教学的质量,我们评估了另一个LLM的质量。24个NLP任务实验显示,我们自动生成的指示大大超越了LMM的基线,并且取得了与人类教师在19/24年任务中制定的指示的更好或可比的性能。我们通过对APEM工程设计的提示进行广泛的定性和定量分析,以便探索APEP的性能。我们显示,APE设计的提示可以用于指导模型走向真实性和/或信息性能,以及改进在选定指示之后的LMMMM(Pasim-passimpealbiling)的几张自动校外的成绩。通过在数据库学习标准/Proprevalstotototototototols_totototototototototototolsmaximpal