巴耶斯分配政策梯度 (Bayesian Distributional Policy Gradients) - 专知论文

会员服务 ·

0

学成 · 总回报 · INFORMS · Performer · 估计/估计量 ·

2021 年 3 月 23 日

Bayesian Distributional Policy Gradients

翻译：巴耶斯分配政策梯度

Luchen Li,A. Aldo Faisal

Distributional Reinforcement Learning (RL) maintains the entire probability distribution of the reward-to-go, i.e. the return, providing more learning signals that account for the uncertainty associated with policy performance, which may be beneficial for trading off exploration and exploitation and policy learning in general. Previous works in distributional RL focused mainly on computing the state-action-return distributions, here we model the state-return distributions. This enables us to translate successful conventional RL algorithms that are based on state values into distributional RL. We formulate the distributional Bellman operation as an inference-based auto-encoding process that minimises Wasserstein metrics between target/model return distributions. The proposed algorithm, BDPG (Bayesian Distributional Policy Gradients), uses adversarial training in joint-contrastive learning to estimate a variational posterior from the returns. Moreover, we can now interpret the return prediction uncertainty as an information gain, which allows to obtain a new curiosity measure that helps BDPG steer exploration actively and efficiently. We demonstrate in a suite of Atari 2600 games and MuJoCo tasks, including well known hard-exploration challenges, how BDPG learns generally faster and with higher asymptotic performance than reference distributional RL algorithms.

翻译：分配局以前的工作主要侧重于计算国家-行动-回报分布,这里我们模拟国家-回报分布。这使我们能够将基于国家价值的成功常规RL算法转化为分配局。我们把分配局Bellman 业务设计成一个基于推论的自动编码程序,以尽量减少目标/模版回报分布之间的瓦塞尔斯坦度量度值。拟议的算法,BDPG(Bayesian分配局政策重点),在联合调试学习中使用对抗性培训,以估计回报的变异后背值。此外,我们现在可以将返回预测不确定性解释为一种信息收益,从而获得一种新的好奇度测量,帮助BDPG积极有效地指导探索。我们在一套Atari 2600游戏和MujoCo任务中展示了比众所周知的硬盘浏览率更快的学习速度。

0

相关内容

MIT科学家Dimitri P. Bertsekas最新《强化学习与最优控制》2021ASU课程，(附书稿PDF&讲义)

MIT科学家Dimitri P. Bertsekas最新《强化学习与最优控制》2021ASU课程，(附书稿PDF&讲义)

专知会员服务

91+阅读 · 2021年1月17日

近期必读的六篇 ICLR 2021【推荐系统】相关投稿论文

近期必读的六篇 ICLR 2021【推荐系统】相关投稿论文

专知会员服务

47+阅读 · 2020年10月13日

【IJCAI2020】基于生成对抗模仿学习的多模态模仿学习算法框架

【IJCAI2020】基于生成对抗模仿学习的多模态模仿学习算法框架

专知会员服务

58+阅读 · 2020年5月26日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【AAAI Tutorials 2019】深度贝叶斯与序列学习（ Deep Bayesian and Sequential Learning）

【AAAI Tutorials 2019】深度贝叶斯与序列学习（ Deep Bayesian and Sequential Learning）

专知会员服务

72+阅读 · 2019年11月18日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

MIT新书《强化学习与最优控制》

MIT新书《强化学习与最优控制》

专知会员服务

280+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

腊月廿八 | 强化学习-TRPO和PPO背后的数学

腊月廿八 | 强化学习-TRPO和PPO背后的数学

AI研习社

18+阅读 · 2019年2月2日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

Hindsight Trust Region Policy Optimization

Hindsight Trust Region Policy Optimization

Arxiv

0+阅读 · 2021年5月17日

On the Distributional Properties of Adaptive Gradients

Arxiv

0+阅读 · 2021年5月15日

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Arxiv

0+阅读 · 2021年5月14日

From Multisets over Distributions to Distributions over Multisets

Arxiv

0+阅读 · 2021年5月14日

Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning

Arxiv

0+阅读 · 2021年5月14日

On the capacity of deep generative networks for approximating distributions

Arxiv

0+阅读 · 2021年5月13日

Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization

Arxiv

8+阅读 · 2020年11月26日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Arxiv

5+阅读 · 2017年8月25日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

MIT科学家Dimitri P. Bertsekas最新《强化学习与最优控制》2021ASU课程，(附书稿PDF&讲义)

MIT科学家Dimitri P. Bertsekas最新《强化学习与最优控制》2021ASU课程，(附书稿PDF&讲义)

专知会员服务

91+阅读 · 2021年1月17日

近期必读的六篇 ICLR 2021【推荐系统】相关投稿论文

近期必读的六篇 ICLR 2021【推荐系统】相关投稿论文

专知会员服务

47+阅读 · 2020年10月13日

【IJCAI2020】基于生成对抗模仿学习的多模态模仿学习算法框架

【IJCAI2020】基于生成对抗模仿学习的多模态模仿学习算法框架

专知会员服务

58+阅读 · 2020年5月26日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【AAAI Tutorials 2019】深度贝叶斯与序列学习（ Deep Bayesian and Sequential Learning）

【AAAI Tutorials 2019】深度贝叶斯与序列学习（ Deep Bayesian and Sequential Learning）

专知会员服务

72+阅读 · 2019年11月18日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

【加州大学伯克利分校博士论文】通过自我监督预测学习泛化

专知会员服务

65+阅读 · 2019年10月9日

MIT新书《强化学习与最优控制》

MIT新书《强化学习与最优控制》

专知会员服务

280+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

【牛津博士论文】零样本强化学习综述

《美军条令：陆军指挥官与规划人员地理空间指南》60页

战术边缘指挥控制：防务面临的核心挑战

迈向开放世界检测：综述

相关资讯

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

腊月廿八 | 强化学习-TRPO和PPO背后的数学

腊月廿八 | 强化学习-TRPO和PPO背后的数学

AI研习社

18+阅读 · 2019年2月2日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Hindsight Trust Region Policy Optimization

Hindsight Trust Region Policy Optimization

Arxiv

0+阅读 · 2021年5月17日

On the Distributional Properties of Adaptive Gradients

Arxiv

0+阅读 · 2021年5月15日

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Arxiv

0+阅读 · 2021年5月14日

From Multisets over Distributions to Distributions over Multisets

Arxiv

0+阅读 · 2021年5月14日

Non-decreasing Quantile Function Network with Efficient Exploration for Distributional Reinforcement Learning

Arxiv

0+阅读 · 2021年5月14日

On the capacity of deep generative networks for approximating distributions

Arxiv

0+阅读 · 2021年5月13日

Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization

Arxiv

8+阅读 · 2020年11月26日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

GPU-Accelerated Robotic Simulation for Distributed Reinforcement Learning

Arxiv

4+阅读 · 2018年10月24日

SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient

Arxiv

5+阅读 · 2017年8月25日

微信扫码咨询专知VIP会员