This study presents the first comprehensive evaluation of thinking budget mechanisms in medical reasoning tasks, revealing fundamental scaling laws between computational resources and reasoning quality. We systematically evaluated two major model families, Qwen3 (1.7B to 235B parameters) and DeepSeek-R1 (1.5B to 70B parameters), across 15 medical datasets spanning diverse specialties and difficulty levels. Through controlled experiments with thinking budgets ranging from zero to unlimited tokens, we establish logarithmic scaling relationships where accuracy improvements follow a predictable pattern with both thinking budget and model size. Our findings identify three distinct efficiency regimes: high-efficiency (0 to 256 tokens) suitable for real-time applications, balanced (256 to 512 tokens) offering optimal cost-performance tradeoffs for routine clinical support, and high-accuracy (above 512 tokens) justified only for critical diagnostic tasks. Notably, smaller models demonstrate disproportionately larger benefits from extended thinking, with 15 to 20% improvements compared to 5 to 10% for larger models, suggesting a complementary relationship where thinking budget provides greater relative benefits for capacity-constrained models. Domain-specific patterns emerge clearly, with neurology and gastroenterology requiring significantly deeper reasoning processes than cardiovascular or respiratory medicine. The consistency between Qwen3 native thinking budget API and our proposed truncation method for DeepSeek-R1 validates the generalizability of thinking budget concepts across architectures. These results establish thinking budget control as a critical mechanism for optimizing medical AI systems, enabling dynamic resource allocation aligned with clinical needs while maintaining the transparency essential for healthcare deployment.
翻译:本研究首次对医疗推理任务中的思维预算机制进行了全面评估,揭示了计算资源与推理质量之间的基本缩放定律。我们系统评估了两个主要模型系列——Qwen3(1.7B至235B参数)和DeepSeek-R1(1.5B至70B参数)——在涵盖不同专业领域和难度等级的15个医疗数据集上的表现。通过将思维预算从零到无限令牌进行受控实验,我们建立了对数缩放关系:准确率提升随思维预算和模型规模呈现可预测的变化规律。研究发现存在三种不同的效率区间:适用于实时应用的高效区间(0至256令牌)、为常规临床支持提供最优性价比的平衡区间(256至512令牌),以及仅适用于关键诊断任务的高精度区间(512令牌以上)。值得注意的是,较小模型从扩展思维中获得的收益不成比例地更大(提升15%至20%),而较大模型仅提升5%至10%,这表明思维预算与模型能力存在互补关系:对能力受限的模型,思维预算能提供更大的相对收益。领域特异性模式清晰显现:神经病学和胃肠病学比心血管或呼吸医学需要更深入的推理过程。Qwen3原生思维预算API与我们为DeepSeek-R1提出的截断方法之间的一致性,验证了思维预算概念在不同架构间的普适性。这些结果表明,思维预算控制是优化医疗人工智能系统的关键机制,能够实现与临床需求动态匹配的资源分配,同时保持医疗部署所必需的透明度。