关于LLM混合在线强化与模仿学习的注记：公式化与算法 (A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms) - 专知论文

会员服务 ·

0

梯度 · 模仿学习 · 混合 · 在线 · 算法 ·

A Note on Hybrid Online Reinforcement and Imitation Learning for LLMs: Formulations and Algorithms

翻译：关于LLM混合在线强化与模仿学习的注记：公式化与算法

Yingru Li,Ziniu Li,Jiacai Liu

We present a unified framework for Large Language Model (LLM) fine-tuning that integrates Imitation Learning and Reinforcement Learning. By analyzing the gradient of a composite objective combining trajectory-level KL divergence with task rewards, we derive a natural decomposition into two components: (1) an analytically computable Dense Gradient for token-level imitation, and (2) a Monte Carlo estimated Sparse Gradient for long-horizon reward optimization. The Dense Gradient admits a closed-form logit-level formula, enabling efficient GPU implementation.

翻译：本文提出了一个统一的大型语言模型（LLM）微调框架，该框架整合了模仿学习与强化学习。通过分析一个结合轨迹级KL散度与任务奖励的复合目标的梯度，我们推导出一种自然的分解，将其分为两个部分：(1) 用于词元级模仿的、可解析计算的稠密梯度，以及 (2) 用于长时程奖励优化的、通过蒙特卡洛估计的稀疏梯度。该稠密梯度具有一个闭式的逻辑值级公式，从而支持高效的GPU实现。

0

相关内容

梯度的本意是一个向量（矢量），表示某一函数在该点处的方向导数沿着该方向取得最大值，即函数在该点处沿着该方向（此梯度的方向）变化最快，变化率最大（为该梯度的模）。

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

最新《机器学习最优化》课程笔记，36页pdf，Optimization for Machine Learning

专知会员服务

171+阅读 · 2020年5月10日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

开放知识图谱

36+阅读 · 2018年3月30日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

低维有限典型群与线传递2-(v,k,1)设计

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Jacobi行列式和Hilbert变换中的若干问题及应用

国家自然科学基金

0+阅读 · 2014年12月31日

Berezin变换及相关的算子理论

国家自然科学基金

1+阅读 · 2014年12月31日

Locally Repairable Convertible Codes: Improved Lower Bound and General Construction

Arxiv

0+阅读 · 12月25日

Quantum Gates from Wolfram Model Multiway Rewriting Systems

Arxiv

0+阅读 · 12月23日

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

Arxiv

0+阅读 · 12月21日

A Dependent Feature Allocation Model Based on Random Fields

Arxiv

0+阅读 · 12月19日

SCAFFLSA: Taming Heterogeneity in Federated Linear Stochastic Approximation and TD Learning

Arxiv

0+阅读 · 12月19日

VIP会员

文章信息

相关主题

相关VIP内容

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

【伯克利JD Co-Reyes博士论文】建立强化学习算法泛化:从潜在动力学模型到元学习，Building Reinforcement Learning Algorithms that Generalize: From Latent Dynamics Models to Meta-Learning

专知会员服务

45+阅读 · 2022年3月6日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

KG-BERT：基于BERT的知识图谱补全，KG-BERT: BERT for Knowledge Graph Completion

专知会员服务

195+阅读 · 2020年5月31日

最新《机器学习最优化》课程笔记，36页pdf，Optimization for Machine Learning

专知会员服务

171+阅读 · 2020年5月10日

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

【深度图相似学习综述】Deep Graph Similarity Learning: A Survey，29页pdf，117条参考文献

专知会员服务

98+阅读 · 2019年12月31日

热门VIP内容

开通专知VIP会员享更多权益服务

《北约联合仿真与集成、验证与鉴定服务标准》2025最新40页

《面向协同任务的无人地面车辆与无人机（UGV-UAV）集成研究综述》2025最新综述论文

《理解大语言模型在军事战术任务规划中的局限性》

《国防与安全会议论文集》最新80页

相关资讯

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

论文浅尝 | Know-Evolve: Deep Temporal Reasoning for Dynamic KG

开放知识图谱

36+阅读 · 2018年3月30日

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

语义分割中的深度学习方法全解：从FCN、SegNet到DeepLab

炼数成金订阅号

26+阅读 · 2017年7月10日

相关论文

Locally Repairable Convertible Codes: Improved Lower Bound and General Construction

Arxiv

0+阅读 · 12月25日

Quantum Gates from Wolfram Model Multiway Rewriting Systems

Arxiv

0+阅读 · 12月23日

A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models

Arxiv

0+阅读 · 12月21日

A Dependent Feature Allocation Model Based on Random Fields

Arxiv

0+阅读 · 12月19日

SCAFFLSA: Taming Heterogeneity in Federated Linear Stochastic Approximation and TD Learning

Arxiv

0+阅读 · 12月19日

相关基金

粗糙回归模型与算法研究

国家自然科学基金

8+阅读 · 2015年12月31日

低维有限典型群与线传递2-(v,k,1)设计

国家自然科学基金

0+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Jacobi行列式和Hilbert变换中的若干问题及应用

国家自然科学基金

0+阅读 · 2014年12月31日

Berezin变换及相关的算子理论

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员