与现实主义行动者-批评者相比,对价值的平衡低估和高估 (Balancing Value Underestimation and Overestimation with Realistic Actor-Critic) - 专知论文

会员服务 ·

0

欠估计 · 过估计 · 可约的 · 值函数近似 · 学成 ·

2021 年 11 月 10 日

Balancing Value Underestimation and Overestimation with Realistic Actor-Critic

翻译：与现实主义行动者-批评者相比,对价值的平衡低估和高估

Sicen Li,Gang Wang,Qinyun Tang,Liquan Wang

from arxiv, Added references. Corrected typos

Model-free deep reinforcement learning (RL) has been successfully applied to challenging continuous control domains. However, poor sample efficiency prevents these methods from being widely used in real-world domains. We address this problem by proposing a novel model-free algorithm, Realistic Actor-Critic(RAC), which aims to solve trade-offs between value underestimation and overestimation by learning a policy family concerning various confidence-bounds of Q-function. We construct uncertainty punished Q-learning(UPQ), which uses uncertainty from the ensembling of multiple critics to control estimation bias of Q-function, making Q-functions smoothly shift from lower- to higher-confidence bounds. With the guide of these critics, RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn many optimistic and pessimistic policies with the same neural network. Optimistic policies generate effective exploratory behaviors, while pessimistic policies reduce the risk of value overestimation to ensure stable updates of policies and Q-functions. The proposed method can be incorporated with any off-policy actor-critic RL algorithms. Our method achieve 10x sample efficiency and 25\% performance improvement compared to SAC on the most challenging Humanoid environment, obtaining the episode reward $11107\pm 475$ at $10^6$ time steps. All the source codes are available at https://github.com/ihuhuhu/RAC.

翻译：无模型深度强化学习(RL)已成功应用于挑战连续控制域。然而,由于抽样效率低,这些方法无法在现实世界域被广泛使用。我们通过提出一个新的无模型算法(Realistic Actor-Critic (RAC))来解决这一问题,该算法旨在解决价值低估和高估之间的权衡,该算法旨在通过学习关于各种信任范围的Q功能的政策大家庭来解决价值低估和高估之间的权衡问题。我们构建了惩罚性Q学习(UPQQ)的不确定性,它利用多方批评者集合的不确定性来控制Q功能的估测偏差,使Q功能从低信任的界限平稳地从低向高信任的界限转变。根据这些批评家的指南,RAC使用通用值函数匹配器(UVFFA)来同时学习许多乐观和悲观政策与同一神经网络之间的权衡。乐观政策产生了有效的探索行为,而悲观政策减少了价值高估的风险,以确保政策和功能的稳定更新。拟议的方法可以与任何离政策方的S-RC/Rassimal的改进方法相结合。

0

相关内容

欠估计

【图与几何深度学习，53页ppt】Graph and geometric deep learning

专知会员服务

90+阅读 · 2021年6月14日

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

54+阅读 · 2020年9月7日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

252+阅读 · 2020年4月19日

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

专知会员服务

35+阅读 · 2019年11月30日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

最前沿：深度解读Soft Actor-Critic 算法

最前沿：深度解读Soft Actor-Critic 算法

极市平台

55+阅读 · 2019年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年1月14日

On the Estimation Bias in Double Q-Learning

On the Estimation Bias in Double Q-Learning

Arxiv

0+阅读 · 2022年1月14日

Unified Mobility Estimation Mode

Arxiv

0+阅读 · 2022年1月13日

A Non-Classical Parameterization for Density Estimation Using Sample Moments

Arxiv

0+阅读 · 2022年1月13日

A Method for Estimating the Entropy of Time Series Using Artificial Neural Networks

Arxiv

0+阅读 · 2022年1月13日

Efficient Continuous Control with Double Actors and Regularized Critics

Arxiv

6+阅读 · 2021年6月6日

R-LINS: A Robocentric Lidar-Inertial State Estimator for Robust and Efficient Navigation

R-LINS: A Robocentric Lidar-Inertial State Estimator for Robust and Efficient Navigation

Arxiv

3+阅读 · 2019年8月22日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Arxiv

6+阅读 · 2018年1月16日

VIP会员

文章信息

相关主题

值函数近似

相关VIP内容

【图与几何深度学习，53页ppt】Graph and geometric deep learning

专知会员服务

90+阅读 · 2021年6月14日

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

【MIT】自监督几何感知，22页ppt，Self-supervised Geometric Perception

专知会员服务

23+阅读 · 2021年6月3日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

54+阅读 · 2020年9月7日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

112+阅读 · 2020年5月15日

因果图，Causal Graphs，52页ppt

因果图，Causal Graphs，52页ppt

专知会员服务

252+阅读 · 2020年4月19日

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

【变分推断课件】Lectures on Variational Inference： Approximate Bayesian Inference in Machine Learning（附带pdf）

专知会员服务

35+阅读 · 2019年11月30日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大语言模型时代的文档智能：综述

蜂窝通信是否是无人机与无人地面战车主宰战场的关键？

文档视觉问答简述

最新新Agent综述！76页327篇论文梳理，北交大桑基韬教授团队发布《迈向模型原生智能体式人工智能的范式转变综述》

相关资讯

最前沿：深度解读Soft Actor-Critic 算法

最前沿：深度解读Soft Actor-Critic 算法

极市平台

55+阅读 · 2019年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年1月14日

On the Estimation Bias in Double Q-Learning

On the Estimation Bias in Double Q-Learning

Arxiv

0+阅读 · 2022年1月14日

Unified Mobility Estimation Mode

Arxiv

0+阅读 · 2022年1月13日

A Non-Classical Parameterization for Density Estimation Using Sample Moments

Arxiv

0+阅读 · 2022年1月13日

A Method for Estimating the Entropy of Time Series Using Artificial Neural Networks

Arxiv

0+阅读 · 2022年1月13日

Efficient Continuous Control with Double Actors and Regularized Critics

Arxiv

6+阅读 · 2021年6月6日

R-LINS: A Robocentric Lidar-Inertial State Estimator for Robust and Efficient Navigation

R-LINS: A Robocentric Lidar-Inertial State Estimator for Robust and Efficient Navigation

Arxiv

3+阅读 · 2019年8月22日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

Improved Image Captioning via Policy Gradient optimization of SPIDEr

Arxiv

6+阅读 · 2018年3月12日

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Arxiv

6+阅读 · 2018年1月16日

微信扫码咨询专知VIP会员