Performing inference on hundreds of thousands of samples with large language models (LLMs) can be computationally and financially costly. We propose batch prompting, a simple alternative prompting approach that enables the LLM to run inference in batches, instead of one sample at a time. Our method reduces both token and time costs while retaining downstream performance. We theoretically demonstrate that under a few-shot in-context learning setting, the inference costs decrease almost inverse linearly with the number of samples in each batch. We extensively validate the effectiveness of batch prompting on ten datasets across commonsense QA, arithmetic reasoning, and NLI/NLU: batch prompting significantly~(up to $5\times$ with six samples in batch) reduces the LLM (Codex) inference token and time costs while achieving better or comparable performance. Our analysis shows that the number of samples in each batch and the complexity of tasks affect its performance. Further, batch prompting can be applied across different LLMs and reasoning methods.
翻译:对具有大语言模型的数十万个样本进行推断,可以计算成本,而且费用也很高。我们建议批量催化,这是一种简单的替代催化方法,使LLM能够分批进行推断,而不是一次进行一次抽样。我们的方法在保持下游性能的同时降低了象征性和时间成本。我们理论上表明,在几张文本学习环境中,推论成本随着每批样本的数量而几乎线性地下降。我们广泛验证了批量在常见QA、算术推理和NLI/NLU的十套数据中进行催化的有效性:批量大量催化~(最多5美元,每批有6个样本)降低LLMM(Codex)推论和时间成本,同时实现更好或可比的性能。我们的分析表明,每批样本的数量和任务的复杂性会影响其性能。此外,批量的催化可适用于不同的LMS和推理方法。