Large language models such as GPT-3 and PaLM have shown remarkable performance in few-shot learning. However, they still struggle with reasoning tasks such as the arithmetic benchmark GSM8K. Recent advances deliberately guide the language model to generate a chain of reasoning steps before producing the final answer, successfully boosting the GSM8K benchmark from 17.9% to 58.1% in terms of problem solving rate. In this paper, we propose a new approach, DiVeRSe (Diverse Verifier on Reasoning Step), to further advance their reasoning capability. DiVeRSe first explores different prompts to enhance the diversity in reasoning paths. Second, DiVeRSe introduces a verifier to distinguish good answers from bad answers for a better weighted voting. Finally, DiVeRSe verifies the correctness of each single step rather than all the steps in a whole. We conduct extensive experiments using the latest language model code-davinci-002 and demonstrate that DiVeRSe can achieve new state-of-the-art performance on six out of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%), outperforming the PaLM model with 540B parameters.
翻译:GPT-3和PALM等大型语言模型在短短的学习中表现出了显著的成绩,然而,它们仍然在与诸如算术基准GSM8K等推理任务作斗争。最近的一些进展有意地指导语言模型,以便在提出最后答案之前形成一系列推理步骤,成功地将GSM8K基准从17.9%提高到58.1%,在解决问题率方面成功地将GSM8K基准从17.9%提高到58.1%。在本文件中,我们提议了一种新方法,即DiVERSe(理性步骤不同验证器),以进一步提升其推理能力。DiVERSe首先探索了不同的速度,以加强推理路径的多样性。第二,DiVERSe引入了一个核查器,将好答案与错误的答案区别开来,以便作出更好的加权投票。最后,DiVERSe核实了每个步骤的正确性,而不是整个步骤。我们利用最新的语言模型代码-davinci-002进行广泛的实验,并证明DVeRSe能够在八个推理学基准中的6项(例如,GSM840%至83.2%)中,将PALM参数比5M的参数推算出5-M参数的5B。