A practical approach to activate long chain-of-thoughts reasoning ability in pre-trained large language models is to perform supervised fine-tuning on instruction datasets synthesized by strong Large Reasoning Models such as DeepSeek-R1, offering a cost-effective alternative to reinforcement learning. However, large-scale instruction sets with more than 100k samples incur significant training overhead, while effective strategies for automatic long-CoT instruction selection still remain unexplored. In this work, we propose Select2Reason, a novel and efficient instruction-tuning data selection framework for long-CoT reasoning. From the perspective of emergence of rethinking behaviors like self-correction and backtracking, we investigate common metrics that may determine the quality of long-CoT reasoning instructions. Select2Reason leverages a quantifier to estimate difficulty of question and jointly incorporates a reasoning trace length-based heuristic through a weighted scheme for ranking to prioritize high-utility examples. Empirical results on OpenR1-Math-220k demonstrate that fine-tuning LLM on only 10% of the data selected by Select2Reason achieves performance competitive with or superior to full-data tuning and open-source baseline OpenR1-Qwen-7B across three competition-level and six comprehensive mathematical benchmarks. Further experiments highlight the scalability in varying data size, efficiency during inference, and its adaptability to other instruction pools with minimal cost.
翻译:激活预训练大语言模型中长思维链推理能力的一种实用方法是在由DeepSeek-R1等强大推理模型合成的指令数据集上进行监督微调,这为强化学习提供了一种经济高效的替代方案。然而,超过10万样本的大规模指令集会带来显著的训练开销,而针对长思维链指令的自动选择策略仍未得到充分探索。本研究提出Select2Reason,一个面向长思维链推理的新型高效指令调优数据选择框架。从自我修正与回溯等再思考行为涌现的视角出发,我们探究了可能决定长思维链推理指令质量的通用度量指标。Select2Reason通过量化器评估问题难度,并融合基于推理轨迹长度的启发式规则,采用加权排序机制优先选择高效用样本。在OpenR1-Math-220k数据集上的实验表明,仅使用Select2Reason选取的10%数据进行微调,其性能在三个竞赛级和六个综合性数学基准测试中均达到或超越全数据微调及开源基线模型OpenR1-Qwen-7B。进一步实验验证了该方法在不同数据规模下的可扩展性、推理阶段的高效性,以及以极低成本适配其他指令池的强适应性。