Step-by-step reasoning approaches like chain-of-thought (CoT) have proved to be a very effective technique to induce reasoning capabilities in large language models. However, the success of the CoT approach depends primarily on model size, and often billion parameter-scale models are needed to get CoT to work. In this paper, we propose a knowledge distillation approach, that leverages the step-by-step CoT reasoning capabilities of larger models and distils these reasoning abilities into smaller models. Our approach Decompositional Distillation learns a semantic decomposition of the original problem into a sequence of subproblems and uses it to train two models: a) a problem decomposer that learns to decompose the complex reasoning problem into a sequence of simpler sub-problems and b) a problem solver that uses the intermediate subproblems to solve the overall problem. On a multi-step math word problem dataset (GSM8K), we boost the performance of GPT-2 variants up to 35% when distilled with our approach compared to CoT. We show that using our approach, it is possible to train a GPT-2-large model (775M) that can outperform a 10X larger GPT-3 (6B) model trained using CoT reasoning. Finally, we also demonstrate that our approach of problem decomposition can also be used as an alternative to CoT prompting, which boosts the GPT-3 performance by 40% compared to CoT prompts.
翻译:一步一步的推理方法,如思维链(CoT),已证明是一种非常有效的技术,可以引导大型语言模型的推理能力。然而,CoT方法的成功主要取决于模型大小,需要10亿个参数尺度模型才能使CoT发挥作用。在本文件中,我们建议了一种知识蒸馏方法,利用大型模型的逐步推理能力,将这些推理能力分解成较小的模型。我们的方法分解法学会了将最初的问题推入一个子问题序列的语义分解,并用它来培训两种模型:a)一个问题解析器,学会将复杂的推理问题分解成一个更简单的子问题序列;b)一个问题解析器,利用中间的子题解析能力解决整个问题。关于多步的数学词问题数据集(GSM8K),我们的方法是将GPT-2变异体提升到35 %,而我们的方法与COT相比,我们也可以用一个更大的推理学方法(7-M)来训练G-25。我们用G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-L-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-L-L-G-L-G-G-G-G-G-G-G-G-G-L-L-L-G-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-G-L-L-L-L-L-L-L-L-L-L-L-G-L-L-L-L-L-L-L-L-L-L-G-G-G-L-L-G-G-G-G-G-L-L-L-L