数学推理能力是否提升大语言模型的通用能力？理解大语言模型推理的可迁移性 (Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning)

Math reasoning has become the poster child of progress in large language models (LLMs), with new models rapidly surpassing human-level performance on benchmarks like MATH and AIME. But as math leaderboards improve week by week, it is worth asking: do these gains reflect broader problem-solving ability or just narrow overfitting? To answer this question, we evaluate over 20 open-weight reasoning-tuned models across a broad suite of tasks, including math, scientific QA, agent planning, coding, and standard instruction-following. We surprisingly find that most models that succeed in math fail to transfer their gains to other domains. To rigorously study this phenomenon, we conduct controlled experiments on Qwen3-14B models using math-only data but different tuning methods. We find that reinforcement learning (RL)-tuned models generalize well across domains, while supervised fine-tuning (SFT)-tuned models often forget general capabilities. Latent-space representation and token-space distribution shift analyses reveal that SFT induces substantial representation and output drift, while RL preserves general-domain structure. Our results suggest a need to rethink standard post-training recipes, particularly the reliance on SFT-distilled data for advancing reasoning models.

翻译：数学推理已成为大语言模型（LLM）进展的标志性领域，新模型在MATH和AIME等基准测试中迅速超越人类水平。但随着数学排行榜每周刷新，值得追问：这些进步反映的是更广泛的解决问题的能力，还是仅仅狭隘的过拟合？为回答此问题，我们在涵盖数学、科学问答、智能体规划、代码生成及标准指令遵循的广泛任务套件上，评估了超过20个开源推理微调模型。我们意外地发现，大多数在数学上成功的模型未能将其增益迁移至其他领域。为严谨研究此现象，我们使用纯数学数据但不同微调方法，在Qwen3-14B模型上进行了受控实验。研究发现，强化学习（RL）微调的模型能良好泛化至各领域，而监督微调（SFT）的模型常遗忘通用能力。潜在空间表征与词元空间分布偏移分析表明，SFT会引发显著的表示层与输出层漂移，而RL能保持通用领域结构。我们的结果表明，有必要重新思考标准的后训练方案，特别是依赖SFT蒸馏数据来推进推理模型发展的现行做法。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日