Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.
翻译:数学推理已成为大语言模型(LLM)进展的标志性领域,新模型在MATH和AIME等基准测试中迅速超越人类水平。但随着数学排行榜每周刷新,值得追问:这些进步反映的是更广泛的解决问题的能力,还是仅仅狭隘的过拟合?为回答此问题,我们在涵盖数学、科学问答、智能体规划、代码生成及标准指令遵循的广泛任务套件上,评估了超过20个开源推理微调模型。我们意外地发现,大多数在数学上成功的模型未能将其增益迁移至其他领域。为严谨研究此现象,我们使用纯数学数据但不同微调方法,在Qwen3-14B模型上进行了受控实验。研究发现,强化学习(RL)微调的模型能良好泛化至各领域,而监督微调(SFT)的模型常遗忘通用能力。潜在空间表征与词元空间分布偏移分析表明,SFT会引发显著的表示层与输出层漂移,而RL能保持通用领域结构。我们的结果表明,有必要重新思考标准的后训练方案,特别是依赖SFT蒸馏数据来推进推理模型发展的现行做法。