Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3's reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM's factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.
翻译:大型语言模型(LLMS)通过微小的促动,显示了令人印象深刻的能力; 商业化的API,如OpenAI GPT-3,进一步增加了其在现实世界语言应用中的使用; 然而,如何提高GPT-3可靠性的关键问题仍然未得到充分探讨; 虽然可靠性是一个广泛且定义模糊的术语,但我们将可靠性分解为四个主要方面,这四个方面与现有的ML安全框架相对应,并被广泛承认是重要的:普遍性、社会偏见、校准和事实质量。 我们的核心贡献是建立简单有效的提示,提高GPT-3的可靠性:1) 通用分配,2) 平衡人口分布和采用自然语言指示以减少社会偏见,3) 校准产出概率和4) 更新LM的实际情况知识和推理链。有了适当的提示,GPT-3比较小的监督模型在所有这些方面都更加可靠。我们发布了所有经过处理的数据集、评价脚本和模型预测。我们系统化的经验研究不仅有助于对LMMS的可靠性产生新的洞察力,而且更重要的是,我们更迅速地运用了各种战略。