Existing reinforcement learning (RL)-based post-training methods for large language models have advanced rapidly, yet their design has largely been guided by heuristics rather than systematic theoretical principles. This gap limits our understanding of the properties of the gradient estimators and the associated optimization algorithms, thereby constraining opportunities to improve training stability and overall performance. In this work, we provide a unified theoretical framework that characterizes the statistical properties of commonly used policy-gradient estimators under mild assumptions. Our analysis establishes unbiasedness, derives exact variance expressions, and yields an optimization-loss upper bound that enables principled reasoning about learning dynamics. Building on these results, we prove convergence guarantees and derive an adaptive learning-rate schedule governed by the signal-to-noise ratio (SNR) of gradients. We further show that the variance-optimal baseline is a gradient-weighted estimator, offering a new principle for variance reduction and naturally enhancing stability beyond existing methods. These insights motivate Optimal Baseline and Learning-Rate Policy Optimization (OBLR-PO), an algorithm that jointly adapts learning rates and baselines in a theoretically grounded manner. Experiments on Qwen3-4B-Base and Qwen3-8B-Base demonstrate consistent gains over existing policy optimization methods, validating that our theoretical contributions translate into practical improvements in large-scale post-training.
翻译:现有基于强化学习的大型语言模型后训练方法发展迅速,但其设计大多基于启发式经验而非系统性的理论原则。这一差距限制了对梯度估计器特性及相关优化算法的理解,从而制约了提升训练稳定性与整体性能的机会。本文提出一个统一的理论框架,在温和假设下刻画了常用策略梯度估计器的统计特性。我们的分析确立了无偏性,推导了精确的方差表达式,并给出了一个优化损失上界,从而支持对学习动态进行原理性推理。基于这些结果,我们证明了收敛保证,并推导出一种由梯度信噪比调控的自适应学习率调度策略。进一步,我们证明方差最优基线是一种梯度加权估计器,这为方差缩减提供了新原理,并自然超越了现有方法的稳定性提升能力。这些洞见催生了最优基线与学习率策略优化算法,该算法以理论为基础联合自适应调整学习率与基线。在Qwen3-4B-Base与Qwen3-8B-Base上的实验表明,相较于现有策略优化方法,OBLR-PO取得了持续的性能提升,验证了理论贡献在大规模后训练中转化为实际改进的有效性。