Methods such as chain-of-thought prompting and self-consistency have pushed the frontier of language model reasoning performance with no additional training. To further improve performance, we propose a prompt ensembling method for large language models, which uses a small dataset to construct a set of few shot prompts that together comprise a ``boosted prompt ensemble''. The few shot examples for each prompt are chosen in a stepwise fashion to be ``hard'' examples on which the previous step's ensemble is uncertain. We show that this outperforms single-prompt output-space ensembles and bagged prompt-space ensembles on the GSM8k and AQuA datasets, among others. We propose both train-time and test-time versions of boosted prompting that use different levels of available annotation and conduct a detailed empirical study of our algorithm.
翻译:摘要:在没有额外训练的情况下,像思维链激励和自我一致性这样的方法已经推动了语言模型推理性能的前沿。为了进一步提高性能,我们提出了一种用于大型语言模型的激励集成方法,它使用少量数据集构建一组几个样本的激励,共同组成一个“提升的激励集成”。选定每个提示的少量样本是逐步进行的,这些样本是在之前步骤的集成中不确定的“难”样本。我们证明这种方法在GSM8k和AQuA等数据集上胜过单提示输出空间集成和袋式提示空间集成。我们提出了训练和测试时间的两个版本的激励集成方法,采用了不同的注释级别,并对算法进行了详细的实证研究。