State-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we introduce GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution. To increase performance, we propose training verifiers to judge the correctness of model completions. At test time, we generate many candidate solutions and select the one ranked highest by the verifier. We demonstrate that verification significantly improves performance on GSM8K, and we provide strong empirical evidence that verification scales more effectively with increased data than a finetuning baseline.
翻译:最先进的语言模型可以与人类在许多任务上的表现相匹配,但是它们仍然难以有力地执行多步数学推理。为了诊断当前模型的失败和支持研究,我们引入了GSM8K,这是一个高质量的8.5K高语言多样性的高中数学词词数据集。我们发现,即使最大的变压器模型也未能达到高测试性能,尽管这一问题分布在概念上简单。为了提高性能,我们建议培训核查员来判断模型完成的正确性。在测试时,我们产生了许多候选解决方案,并选择了由核查员排在最高位的解决方案。我们证明,核查大大提高了GSM8K的性能,我们提供了有力的实证证据,证明通过增加数据而不是微调基线来更有效地进行核查。