Recent studies have shown that large language models (LLMs), especially smaller ones, often lack robustness in grade school math (GSM) reasoning. In particular, they tend to experience performance drops when faced with distribution shifts, such as changes to numerical or nominal variables, or insertions of distracting clauses. A possible strategy to address this involves generating synthetic data to further "instantiate" reasoning problems on potential variations. In this work, we instead focuses on the strategy of "abstracting" reasoning problems. This not only helps counteract distribution shifts but also facilitates the connection to symbolic tools for deriving solutions. Focusing on GSM, we find that this abstraction process is better acquired through reinforcement learning (RL) than just supervised fine-tuning, which often fails to produce faithful abstractions. Our method, AbstRaL -- which promotes abstract reasoning in LLMs using RL on granular abstraction data -- significantly mitigates performance degradation on recent GSM perturbation benchmarks. Besides, improving GSM robustness via AbstRaL is shown to also implicitly benefit LLMs' capabilities on OOD mathematical and general reasoning tasks, indicating that abstract thinking broadly enables better generalizability.
翻译:近期研究表明,大型语言模型(LLMs),尤其是较小规模的模型,在小学数学(GSM)推理任务中往往缺乏鲁棒性。具体而言,当面对分布偏移时(如数值或名义变量的变化,或干扰性从句的插入),其性能容易出现下降。一种可能的解决策略是通过生成合成数据来进一步“实例化”针对潜在变体的推理问题。在本研究中,我们转而关注“抽象化”推理问题的策略。这不仅有助于抵消分布偏移,还能促进与符号化工具连接以推导解决方案。聚焦于GSM任务,我们发现这种抽象过程通过强化学习(RL)比仅通过监督微调能更好地习得,后者往往难以产生可靠的抽象表示。我们的方法AbstRaL——通过在细粒度抽象数据上应用RL来提升LLMs的抽象推理能力——显著减轻了在近期GSM扰动基准测试中的性能下降。此外,通过AbstRaL提升GSM鲁棒性也被证明能隐式增强LLMs在分布外数学任务及通用推理任务上的能力,这表明抽象思维广泛地促进了更好的泛化性能。