We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.
翻译:本文提出了一个统一的大型语言模型(LLM)微调框架,该框架整合了模仿学习与强化学习。通过分析一个结合轨迹级KL散度与任务奖励的复合目标的梯度,我们推导出一种自然的分解,将其分为两个部分:(1) 用于词元级模仿的、可解析计算的稠密梯度,以及 (2) 用于长时程奖励优化的、通过蒙特卡洛估计的稀疏梯度。该稠密梯度具有一个闭式的逻辑值级公式,从而支持高效的GPU实现。