Large Language Models (LLMs) have made significant strides in problem-solving by incorporating reasoning processes. However, this enhanced reasoning capability results in an increased number of output tokens during inference, leading to higher computational costs. To address this challenge, we propose TwT (Thinking without Tokens), a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance, while maintaining high performance. Our approach introduces a Habitual Reasoning Distillation method, which internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy inspired by human cognition. Additionally, we propose Dual-Criteria Rejection Sampling (DCRS), a technique that generates a high-quality and diverse distillation dataset using multiple teacher models, making our method suitable for unsupervised scenarios. Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance, achieving up to a 13.6% improvement in accuracy with fewer output tokens compared to other distillation methods, offering a highly practical solution for efficient LLM deployment.
翻译:大型语言模型(LLMs)通过融入推理过程在问题解决方面取得了显著进展。然而,这种增强的推理能力导致推理过程中输出标记数量增加,进而产生更高的计算成本。为应对这一挑战,我们提出了TwT(无标记思维),该方法通过多教师指导的习惯性推理蒸馏来降低推理成本,同时保持高性能。我们的方法引入了习惯性推理蒸馏技术,该技术通过受人类认知启发的教师引导压缩策略,将显式推理内化为模型的习惯性行为。此外,我们提出了双准则拒绝采样(DCRS),这是一种利用多个教师模型生成高质量、多样化蒸馏数据集的技术,使我们的方法适用于无监督场景。实验结果表明,TwT在保持优异性能的同时有效降低了推理成本,与其他蒸馏方法相比,在输出标记更少的情况下实现了高达13.6%的准确率提升,为高效部署LLM提供了极具实用性的解决方案。