Large language models (LLMs) often solve challenging math exercises yet fail to apply the concept right when the problem requires genuine understanding. Popular Reinforcement Learning with Verifiable Rewards (RLVR) pipelines reinforce final answers but provide little fine-grained conceptual signal, so models improve at pattern reuse rather than conceptual applications. We introduce CORE (Concept-Oriented REinforcement), an RL training framework that turns explicit concepts into a controllable supervision signal. Starting from a high-quality, low-contamination textbook resource that links verifiable exercises to concise concept descriptions, we run a sanity probe showing LLMs can restate definitions but fail concept-linked quizzes, quantifying the conceptual reasoning gap. CORE then (i) synthesizes concept-aligned quizzes, (ii) injects brief concept snippets during rollouts to elicit concept-primed trajectories, and (iii) reinforces conceptual reasoning via trajectory replacement after group failures, a lightweight forward-KL constraint that aligns unguided with concept-primed policies, or standard GRPO directly on concept-aligned quizzes. Across several models, CORE delivers consistent gains over vanilla and SFT baselines on both in-domain concept-exercise suites and diverse out-of-domain math benchmarks. CORE unifies direct training on concept-aligned quizzes and concept-injected rollouts under outcome regularization. It provides fine-grained conceptual supervision that bridges problem-solving competence and genuine conceptual reasoning, while remaining algorithm- and verifier-agnostic.
翻译:大型语言模型(LLMs)常能解决具有挑战性的数学练习题,却在问题需要真正理解概念时无法正确应用概念。当前流行的可验证奖励强化学习(RLVR)流程虽能强化最终答案,却难以提供细粒度的概念信号,导致模型仅提升模式复用能力而非概念应用能力。本文提出CORE(面向概念的强化学习),一种将显式概念转化为可控监督信号的强化学习训练框架。我们从一个高质量、低污染且将可验证习题与简明概念描述相关联的教科书资源出发,通过诊断性实验表明:LLMs能够复述定义,却在概念关联的测验中表现不佳,从而量化了概念推理的差距。CORE随后(i)合成与概念对齐的测验题,(ii)在策略推演过程中注入简短的概念片段以激发概念引导的轨迹,并(iii)通过群体失败后的轨迹替换强化概念推理——这是一种轻量级的前向KL约束,用于对齐无引导策略与概念引导策略;亦可直接在概念对齐的测验上采用标准GRPO进行训练。在多个模型上的实验表明,CORE在领域内的概念-习题集以及多样化的领域外数学基准测试中,均较原始模型及监督微调基线取得了一致的性能提升。CORE在结果正则化的框架下,统一了针对概念对齐测验的直接训练与概念注入推演。该框架提供了细粒度的概念监督,弥合了问题解决能力与真实概念推理之间的差距,同时保持算法与验证器的无关性。