Distilling the reasoning capabilities from a large language model (LLM) to a smaller student model often involves training on substantial amounts of reasoning data. However, distillation over lengthy sequences with prompt (P), chain-of-thought (CoT), and answer (A) segments makes the process computationally expensive. In this work, we investigate how the allocation of supervision across different segments (P, CoT, A) affects student performance. Our analysis shows that selective knowledge distillation over only the CoT tokens can be effective when the prompt and answer information is encompassed by it. Building on this insight, we establish a truncation protocol to quantify computation-quality tradeoffs as a function of sequence length. We observe that training on only the first $50\%$ of tokens of every training sequence can retain, on average, $\approx94\%$ of full-sequence performance on math benchmarks while reducing training time, memory usage, and FLOPs by about $50\%$ each. These findings suggest that reasoning distillation benefits from prioritizing early reasoning tokens and provides a simple lever for computation-quality tradeoffs. Codes are available at https://github.com/weiruichen01/distilling-the-essence.
翻译:将大型语言模型(LLM)的推理能力蒸馏至较小的学生模型通常需要在大量推理数据上进行训练。然而,在包含提示(P)、思维链(CoT)和答案(A)段落的冗长序列上进行蒸馏会导致计算成本高昂。在本工作中,我们研究了不同段落(P、CoT、A)间的监督分配如何影响学生模型的性能。我们的分析表明,当提示与答案信息已蕴含于其中时,仅对CoT标记进行选择性知识蒸馏即可取得良好效果。基于此洞见,我们建立了一种截断协议,以量化计算成本与模型质量随序列长度变化的权衡关系。我们观察到,仅使用每个训练序列前$50\%$的标记进行训练,即可在数学基准测试中平均保留约$94\%$的全序列性能,同时将训练时间、内存占用和FLOPs各降低约$50\%$。这些发现表明,推理蒸馏可通过优先关注早期推理标记而获益,并为计算成本与模型质量的权衡提供了一个简单的调节杠杆。代码发布于 https://github.com/weiruichen01/distilling-the-essence。