With the rapid advancement of Large Language Models (LLMs), the Chain-of-Thought (CoT) component has become significant for complex reasoning tasks. However, in conventional Supervised Fine-Tuning (SFT), the model could allocate disproportionately more attention to CoT sequences with excessive length. This reduces focus on the much shorter but essential Key portion-the final answer, whose correctness directly determines task success and evaluation quality. To address this limitation, we propose SFTKey, a two-stage training scheme. In the first stage, conventional SFT is applied to ensure proper output format, while in the second stage, only the Key portion is fine-tuned to improve accuracy. Extensive experiments across multiple benchmarks and model families demonstrate that SFTKey achieves an average accuracy improvement exceeding 5\% over conventional SFT, while preserving the ability to generate correct formats. Overall, this study advances LLM fine-tuning by explicitly balancing CoT learning with additional optimization on answer-relevant tokens.
翻译:随着大语言模型(LLMs)的快速发展,思维链(CoT)组件在复杂推理任务中变得至关重要。然而,在传统的监督微调(SFT)中,模型可能对长度过长的CoT序列分配不成比例的注意力。这降低了对长度短得多但至关重要的关键部分——最终答案的关注度,而答案的正确性直接决定了任务的成功与评估质量。为应对这一局限,我们提出SFTKey,一种两阶段训练方案。在第一阶段,应用传统SFT以确保正确的输出格式;在第二阶段,仅对关键部分进行微调以提高准确率。在多个基准测试和模型系列上的广泛实验表明,SFTKey相较于传统SFT实现了平均超过5%的准确率提升,同时保持了生成正确格式的能力。总体而言,本研究通过显式平衡CoT学习与对答案相关词元的额外优化,推动了大语言模型微调的发展。