最大熵RLHF的失效模式 (Failure Modes of Maximum Entropy RLHF) - 专知论文

会员服务 ·

0

失效 · 在线 · SimPO · 失效模式 · 无参考 ·

Failure Modes of Maximum Entropy RLHF

翻译：最大熵RLHF的失效模式

Ömer Veysel Çağatan,Barış Akgün

from arxiv, 21 pages, 12 figures

In this paper, we show that Simple Preference Optimization (SimPO) can be derived as Maximum Entropy Reinforcement Learning, providing a theoretical foundation for this reference-free method. Motivated by SimPO's strong performance in offline preference optimization, we investigate whether Maximum Entropy RL can achieve similar results in online RLHF settings. Our experiments find that Maximum Entropy RL consistently exhibits overoptimization and unstable KL dynamics, even at very low learning rates. Unlike KL-constrained methods that maintain stable training, entropy regularization fails to prevent reward hacking and appears to correlate with overoptimization. Lastly, we discuss possible explanations for why SimPO succeeds in offline settings while Maximum Entropy RL struggles in online scenarios. Our findings suggest that reference-free approaches may face distinct challenges when applied to online or offline preference learning.

翻译：本文表明，简单偏好优化（SimPO）可推导为最大熵强化学习，从而为这一无参考方法提供了理论基础。受SimPO在离线偏好优化中优异表现的启发，我们探究最大熵强化学习能否在在线RLHF场景中取得类似效果。实验发现，即使在学习率极低的情况下，最大熵强化学习仍持续表现出过度优化及不稳定的KL动态。与保持稳定训练的KL约束方法不同，熵正则化无法阻止奖励黑客行为，且似乎与过度优化现象相关。最后，我们探讨了SimPO在离线场景中成功而最大熵强化学习在在线场景中失效的可能原因。研究结果表明，无参考方法在应用于在线或离线偏好学习时可能面临截然不同的挑战。

0

相关内容

【NeurIPS2025】熵正则化与分布强化学习的收敛定理

【NeurIPS2025】熵正则化与分布强化学习的收敛定理

专知会员服务

12+阅读 · 10月12日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

专知会员服务

13+阅读 · 2019年11月17日

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

LibRec 每周算法：DeepFM

LibRec 每周算法：DeepFM

LibRec智能推荐

14+阅读 · 2017年11月6日

在TensorFlow中对比两大生成模型：VAE与GAN

在TensorFlow中对比两大生成模型：VAE与GAN

机器之心

12+阅读 · 2017年10月23日

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

随机系数和带跳的线性随机微分系统的H2/H∞控制

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

面向时空变化的GIS数据模型

国家自然科学基金

6+阅读 · 2014年12月31日

Emergence of Nonequilibrium Latent Cycles in Unsupervised Generative Modeling

Arxiv

0+阅读 · 12月12日

Structured Approximation of Toeplitz Matrices and Subspaces

Arxiv

0+阅读 · 11月21日

Navigating Quantum Missteps in Agent-Based Modeling: A Schelling Model Case Study

Arxiv

0+阅读 · 11月19日

From Path Coefficients to Targeted Estimands: A Comparison of Structural Equation Models (SEM) and Targeted Maximum Likelihood Estimation (TMLE)

Arxiv

0+阅读 · 11月16日

Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models

Arxiv

0+阅读 · 11月14日

VIP会员

文章信息

相关主题

相关VIP内容

【NeurIPS2025】熵正则化与分布强化学习的收敛定理

【NeurIPS2025】熵正则化与分布强化学习的收敛定理

专知会员服务

12+阅读 · 10月12日

【NeurIPS2020】无限可能的联合对比学习

专知会员服务

29+阅读 · 2020年10月2日

知识图谱嵌入模型的概率标定,Probability Calibration for Knowledge Graph Embedding Models

专知会员服务

36+阅读 · 2020年5月11日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

实时强化学习《Real-Time Reinforcement Learning》S Ramstedt, C Pal [Mila, Element AI] (2019)

专知会员服务

13+阅读 · 2019年11月17日

热门VIP内容

开通专知VIP会员享更多权益服务

大模型推理时代的知识编辑

《利用人工智能对军事行动进行建模》

【MIT博士论文】加速科学发现的因果建模实践算法

机器人、无人机与实时影像：应对城市爆炸威胁的三大技术方案

相关资讯

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图机器学习 2.2-2.4 Properties of Networks, Random Graph

图与推荐

10+阅读 · 2020年3月28日

误差反向传播——CNN

误差反向传播——CNN

统计学习与视觉计算组

30+阅读 · 2018年7月12日

LibRec 每周算法：DeepFM

LibRec 每周算法：DeepFM

LibRec智能推荐

14+阅读 · 2017年11月6日

在TensorFlow中对比两大生成模型：VAE与GAN

在TensorFlow中对比两大生成模型：VAE与GAN

机器之心

12+阅读 · 2017年10月23日

相关论文

Emergence of Nonequilibrium Latent Cycles in Unsupervised Generative Modeling

Arxiv

0+阅读 · 12月12日

Structured Approximation of Toeplitz Matrices and Subspaces

Arxiv

0+阅读 · 11月21日

Navigating Quantum Missteps in Agent-Based Modeling: A Schelling Model Case Study

Arxiv

0+阅读 · 11月19日

From Path Coefficients to Targeted Estimands: A Comparison of Structural Equation Models (SEM) and Targeted Maximum Likelihood Estimation (TMLE)

Arxiv

0+阅读 · 11月16日

Intrinsic Dimension Estimation for Radio Galaxy Zoo using Diffusion Models

Arxiv

0+阅读 · 11月14日

相关基金

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

随机系数和带跳的线性随机微分系统的H2/H∞控制

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

面向时空变化的GIS数据模型

国家自然科学基金

6+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员