面向大语言模型微调的鲁棒人类反馈强化学习 (Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning) - 专知论文

会员服务 ·

0

算法 · 鲁棒 · 语言模型 · 模型微调 · 微调 ·

Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning

翻译：面向大语言模型微调的鲁棒人类反馈强化学习

Kai Ye,Hongyi Zhou,Jin Zhu,Francesco Quinzan,Chengchun Shi

Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models (LLMs) with human preferences. To learn the reward function, most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. In this paper, we propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications. Theoretically, our algorithm reduces the variance of reward and policy estimators, leading to improved regret bounds. Empirical evaluations on LLM benchmark datasets demonstrate that the proposed algorithm consistently outperforms existing methods, with 77-81% of responses being favored over baselines on the Anthropic Helpful and Harmless dataset. The code is available at https:// github.com/ VRPO/ VRPO.

翻译：人类反馈强化学习（RLHF）已成为将大语言模型（LLM）输出与人类偏好对齐的关键技术。为学习奖励函数，现有大多数RLHF算法采用Bradley-Terry模型，该模型基于对人类偏好的假设，但可能无法反映现实世界判断的复杂性和多变性。本文提出一种鲁棒算法，以增强现有方法在奖励模型设定错误情况下的性能。理论上，该算法降低了奖励与策略估计量的方差，从而改进了遗憾界。在LLM基准数据集上的实证评估表明，所提算法持续优于现有方法，在Anthropic Helpful and Harmless数据集上，77-81%的响应优于基线。代码发布于https://github.com/VRPO/VRPO。

0

相关内容

在数学和计算机科学之中，算法（Algorithm）为一个计算的具体步骤，常用于计算、数据处理和自动推理。精确而言，算法是一个表示为有限长列表的有效方法。算法应包含清晰定义的指令用于计算函数。来自维基百科：算法

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日

用Transformer学习通用超参数优化器，DeepMind Yutian Chen博士讲授，附Slides与视频

用Transformer学习通用超参数优化器，DeepMind Yutian Chen博士讲授，附Slides与视频

专知会员服务

40+阅读 · 2023年3月12日

图机器学习与分子分析，NUS- Xavier Bresson教授讲解,附视频与Slides

图机器学习与分子分析，NUS- Xavier Bresson教授讲解,附视频与Slides

专知会员服务

15+阅读 · 2023年1月27日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【AAAI2020-北航】基于规则指导的知识图谱成分表示学习（Rule-Guided Compositional Representation Learning on Knowledge Graphs）

【AAAI2020-北航】基于规则指导的知识图谱成分表示学习（Rule-Guided Compositional Representation Learning on Knowledge Graphs）

专知会员服务

85+阅读 · 2019年11月24日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

8+阅读 · 2015年12月31日

基于相依数据的梯度学习理论研究

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Arxiv

0+阅读 · 12月23日

Learning to Reason in LLMs by Expectation Maximization

Arxiv

0+阅读 · 12月23日

Real-Time Streamable Generative Speech Restoration with Flow Matching

Arxiv

0+阅读 · 12月22日

Towards Reproducibility in Predictive Process Mining: SPICE -- A Deep Learning Library

Arxiv

0+阅读 · 12月19日

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Arxiv

0+阅读 · 12月18日

VIP会员

文章信息

相关主题

相关VIP内容

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

【NeurIPS2023】基于反事实保守Q学习的离线多智能体强化学习

专知会员服务

17+阅读 · 2023年9月25日

用Transformer学习通用超参数优化器，DeepMind Yutian Chen博士讲授，附Slides与视频

用Transformer学习通用超参数优化器，DeepMind Yutian Chen博士讲授，附Slides与视频

专知会员服务

40+阅读 · 2023年3月12日

图机器学习与分子分析，NUS- Xavier Bresson教授讲解,附视频与Slides

图机器学习与分子分析，NUS- Xavier Bresson教授讲解,附视频与Slides

专知会员服务

15+阅读 · 2023年1月27日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

【AAAI2020-北航】基于规则指导的知识图谱成分表示学习（Rule-Guided Compositional Representation Learning on Knowledge Graphs）

【AAAI2020-北航】基于规则指导的知识图谱成分表示学习（Rule-Guided Compositional Representation Learning on Knowledge Graphs）

专知会员服务

85+阅读 · 2019年11月24日

热门VIP内容

开通专知VIP会员享更多权益服务

《北约联合仿真与集成、验证与鉴定服务标准》2025最新40页

《面向协同任务的无人地面车辆与无人机（UGV-UAV）集成研究综述》2025最新综述论文

《理解大语言模型在军事战术任务规划中的局限性》

《国防与安全会议论文集》最新80页

相关资讯

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

相关论文

AgentMath: Empowering Mathematical Reasoning for Large Language Models via Tool-Augmented Agent

Arxiv

0+阅读 · 12月23日

Learning to Reason in LLMs by Expectation Maximization

Arxiv

0+阅读 · 12月23日

Real-Time Streamable Generative Speech Restoration with Flow Matching

Arxiv

0+阅读 · 12月22日

Towards Reproducibility in Predictive Process Mining: SPICE -- A Deep Learning Library

Arxiv

0+阅读 · 12月19日

Smoothing DiLoCo with Primal Averaging for Faster Training of LLMs

Arxiv

0+阅读 · 12月18日

相关基金

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

模糊认知集群优化的聚类算法

国家自然科学基金

8+阅读 · 2015年12月31日

基于相依数据的梯度学习理论研究

国家自然科学基金

1+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于决策模型和预备电位的运动想象BCI研究

国家自然科学基金

3+阅读 · 2015年12月31日

微信扫码咨询专知VIP会员