We systematically evaluate Parameter-Efficient Fine-Tuning (PEFT) methods under the paradigm of Reinforcement Learning with Verifiable Rewards (RLVR). RLVR incentivizes language models to enhance their reasoning capabilities through verifiable feedback; however, while methods like LoRA are commonly used, the optimal PEFT architecture for RLVR remains unidentified. In this work, we conduct the first comprehensive evaluation of over 12 PEFT methodologies across the DeepSeek-R1-Distill families on mathematical reasoning benchmarks. Our empirical results challenge the default adoption of standard LoRA with three main findings. First, we demonstrate that structural variants, such as DoRA, AdaLoRA, and MiSS, consistently outperform LoRA. Second, we uncover a spectral collapse phenomenon in SVD-informed initialization strategies (\textit{e.g.,} PiSSA, MiLoRA), attributing their failure to a fundamental misalignment between principal-component updates and RL optimization. Furthermore, our ablations reveal that extreme parameter reduction (\textit{e.g.,} VeRA, Rank-1) severely bottlenecks reasoning capacity. We further conduct ablation studies and scaling experiments to validate our findings. This work provides a definitive guide for advocating for more exploration for parameter-efficient RL methods.
翻译:我们在可验证奖励强化学习(RLVR)范式下,系统评估了参数高效微调(PEFT)方法。RLVR通过可验证的反馈激励语言模型提升其推理能力;然而,尽管LoRA等方法被广泛使用,适用于RLVR的最优PEFT架构仍未明确。本研究首次在数学推理基准上,对DeepSeek-R1-Distill系列模型中的超过12种PEFT方法进行了全面评估。我们的实证结果对默认采用标准LoRA的做法提出了挑战,主要发现有三点:首先,我们证明结构变体(如DoRA、AdaLoRA和MiSS)持续优于LoRA。其次,我们揭示了基于SVD的初始化策略(例如PiSSA、MiLoRA)中存在谱崩溃现象,并将其失败归因于主成分更新与RL优化之间的根本性错配。此外,消融实验表明,极端参数削减(如VeRA、Rank-1)会严重制约推理能力。我们进一步通过消融研究和规模扩展实验验证了这些发现。本工作为倡导对参数高效RL方法进行更深入探索提供了权威指南。