刺激生物缓解的价值:普遍激活的深层双重决定因素政策梯度 (Value Activation for Bias Alleviation: Generalized-activated Deep Double Deterministic Policy Gradients) - 专知论文

会员服务 ·

0

确定性策略 · 有偏 · 估计/估计量 · Weight · 激活函数 ·

2021 年 12 月 21 日

Value Activation for Bias Alleviation: Generalized-activated Deep Double Deterministic Policy Gradients

翻译：刺激生物缓解的价值:普遍激活的深层双重决定因素政策梯度

Jiafei Lyu,Yu Yang,Jiangpeng Yan,Xiu Li

from arxiv, 13 pages

It is vital to accurately estimate the value function in Deep Reinforcement Learning (DRL) such that the agent could execute proper actions instead of suboptimal ones. However, existing actor-critic methods suffer more or less from underestimation bias or overestimation bias, which negatively affect their performance. In this paper, we reveal a simple but effective principle: proper value correction benefits bias alleviation, where we propose the generalized-activated weighting operator that uses any non-decreasing function, namely activation function, as weights for better value estimation. Particularly, we integrate the generalized-activated weighting operator into value estimation and introduce a novel algorithm, Generalized-activated Deep Double Deterministic Policy Gradients (GD3). We theoretically show that GD3 is capable of alleviating the potential estimation bias. We interestingly find that simple activation functions lead to satisfying performance with no additional tricks, and could contribute to faster convergence. Experimental results on numerous challenging continuous control tasks show that GD3 with task-specific activation outperforms the common baseline methods. We also uncover a fact that fine-tuning the polynomial activation function achieves superior results on most of the tasks.

翻译：准确估计深强化学习(DRL)的值值功能至关重要,因为代理商可以执行适当的行动而不是次优的行动。然而,现有的行为体批评方法或多或少地受到低估偏差或高估偏差的影响,从而对其业绩产生不利影响。在本文中,我们揭示了一个简单但有效的原则:正确的价值纠正有利于减轻偏差,我们在此建议一个使用任何非降级功能(即激活功能,即启动功能,作为更好地估计价值的权重)的通用权重操作员。特别是,我们将普遍激活权重操作操作员纳入价值估算,并引入新的算法(GD3),从理论上说,GD3有能力减轻潜在的估计偏差。我们很有意思地发现,简单的激活功能能够以不增加的技巧满足业绩,有助于更快的趋同。许多具有挑战性的连续控制任务的实验结果显示,GD3与特定激活功能的GD3超越了共同的基准方法。我们还发现,微调了最高级的聚点激活功能能够取得高超结果。

0

相关内容

确定性策略

确定性策略

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

111+阅读 · 2020年5月15日

【硬核书】数学博弈论与应用，431页pdf，Mathematical Game Theory and Applications

【硬核书】数学博弈论与应用，431页pdf，Mathematical Game Theory and Applications

专知会员服务

170+阅读 · 2020年4月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

【斯坦福大学NeuralPS2019】GNN解释器，GNNExplainer: Generating Explanations for Graph Neural Networks，斯坦福大学|Jure Leskovec

【斯坦福大学NeuralPS2019】GNN解释器，GNNExplainer: Generating Explanations for Graph Neural Networks，斯坦福大学|Jure Leskovec

专知会员服务

89+阅读 · 2019年10月13日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

动物脑的好奇心和强化学习的好奇心

动物脑的好奇心和强化学习的好奇心

CreateAMind

10+阅读 · 2019年1月26日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

Residual Bootstrap Exploration for Stochastic Linear Bandit

Residual Bootstrap Exploration for Stochastic Linear Bandit

Arxiv

0+阅读 · 2022年2月23日

Learning Multi-step Robotic Manipulation Policies from Visual Observation of Scene and Q-value Predictions of Previous Action

Arxiv

0+阅读 · 2022年2月23日

Double Thompson Sampling in Finite stochastic Games

Arxiv

0+阅读 · 2022年2月21日

Learning to Control Partially Observed Systems with Finite Memory

Arxiv

0+阅读 · 2022年2月20日

Settling the Variance of Multi-Agent Policy Gradients

Arxiv

8+阅读 · 2021年8月20日

Efficient Continuous Control with Double Actors and Regularized Critics

Arxiv

6+阅读 · 2021年6月6日

Self-correcting Q-Learning

Arxiv

11+阅读 · 2020年12月2日

Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization

Arxiv

8+阅读 · 2020年11月26日

Bipedal Walking Robot using Deep Deterministic Policy Gradient

Bipedal Walking Robot using Deep Deterministic Policy Gradient

Arxiv

3+阅读 · 2018年7月16日

Activation Maximization Generative Adversarial Nets

Arxiv

5+阅读 · 2018年1月30日

VIP会员

文章信息

相关主题

确定性策略

估计/估计量

相关VIP内容

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

Into the Metaverse，93页ppt介绍元宇宙概念、应用、趋势

专知会员服务

49+阅读 · 2022年2月19日

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

【AAAI2021】自校正Q学习，Self-correcting Q-Learning

专知会员服务

17+阅读 · 2020年12月4日

【文本生成现代方法】Modern Methods for Text Generation

【文本生成现代方法】Modern Methods for Text Generation

专知会员服务

44+阅读 · 2020年9月11日

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

Fariz Darari简明《博弈论Game Theory》介绍，35页ppt

专知会员服务

111+阅读 · 2020年5月15日

【硬核书】数学博弈论与应用，431页pdf，Mathematical Game Theory and Applications

【硬核书】数学博弈论与应用，431页pdf，Mathematical Game Theory and Applications

专知会员服务

170+阅读 · 2020年4月18日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

【斯坦福大学NeuralPS2019】GNN解释器，GNNExplainer: Generating Explanations for Graph Neural Networks，斯坦福大学|Jure Leskovec

【斯坦福大学NeuralPS2019】GNN解释器，GNNExplainer: Generating Explanations for Graph Neural Networks，斯坦福大学|Jure Leskovec

专知会员服务

89+阅读 · 2019年10月13日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

《基于大型语言模型的软件工程自动化研究》最新264页

《基于大型语言模型的信号处理管线研究：推进军事电子情报工作流程》最新76页

中文版 | 战争算法：生成式人工智能在战场的崛起

中文版《美国陆军：战术行为性远程医疗实施观察与建议》

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

动物脑的好奇心和强化学习的好奇心

动物脑的好奇心和强化学习的好奇心

CreateAMind

10+阅读 · 2019年1月26日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

Residual Bootstrap Exploration for Stochastic Linear Bandit

Residual Bootstrap Exploration for Stochastic Linear Bandit

Arxiv

0+阅读 · 2022年2月23日

Learning Multi-step Robotic Manipulation Policies from Visual Observation of Scene and Q-value Predictions of Previous Action

Arxiv

0+阅读 · 2022年2月23日

Double Thompson Sampling in Finite stochastic Games

Arxiv

0+阅读 · 2022年2月21日

Learning to Control Partially Observed Systems with Finite Memory

Arxiv

0+阅读 · 2022年2月20日

Settling the Variance of Multi-Agent Policy Gradients

Arxiv

8+阅读 · 2021年8月20日

Efficient Continuous Control with Double Actors and Regularized Critics

Arxiv

6+阅读 · 2021年6月6日

Self-correcting Q-Learning

Arxiv

11+阅读 · 2020年12月2日

Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization

Arxiv

8+阅读 · 2020年11月26日

Bipedal Walking Robot using Deep Deterministic Policy Gradient

Bipedal Walking Robot using Deep Deterministic Policy Gradient

Arxiv

3+阅读 · 2018年7月16日

Activation Maximization Generative Adversarial Nets

Arxiv

5+阅读 · 2018年1月30日

微信扫码咨询专知VIP会员