Jekyll Drkyll和Hyde先生:非政策政策最新情况的奇怪案例 (Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates) - 专知论文

会员服务 ·

0

磁流变材料 · Principle · CASE · Extensibility · 可辨认的 ·

2021 年 9 月 29 日

Dr Jekyll and Mr Hyde: the Strange Case of Off-Policy Policy Updates

翻译：Jekyll Drkyll和Hyde先生:非政策政策最新情况的奇怪案例

Romain Laroche,Remi Tachet

from arxiv, accepted to NeurIPS as a poster

The policy gradient theorem states that the policy should only be updated in states that are visited by the current policy, which leads to insufficient planning in the off-policy states, and thus to convergence to suboptimal policies. We tackle this planning issue by extending the policy gradient theory to policy updates with respect to any state density. Under these generalized policy updates, we show convergence to optimality under a necessary and sufficient condition on the updates' state densities, and thereby solve the aforementioned planning issue. We also prove asymptotic convergence rates that significantly improve those in the policy gradient literature. To implement the principles prescribed by our theory, we propose an agent, Dr Jekyll & Mr Hyde (JH), with a double personality: Dr Jekyll purely exploits while Mr Hyde purely explores. JH's independent policies allow to record two separate replay buffers: one on-policy (Dr Jekyll's) and one off-policy (Mr Hyde's), and therefore to update JH's models with a mixture of on-policy and off-policy updates. More than an algorithm, JH defines principles for actor-critic algorithms to satisfy the requirements we identify in our analysis. We extensively test on finite MDPs where JH demonstrates a superior ability to recover from converging to a suboptimal policy without impairing its speed of convergence. We also implement a deep version of the algorithm and test it on a simple problem where it shows promising results.

翻译：政策梯度理论指出,政策应当只在现行政策所考察的国家更新政策,从而导致非政策国家规划不足,从而导致与次最佳政策趋同。我们通过将政策梯度理论扩展至与任何州密度有关的政策更新,将政策梯度理论扩大到与任何州密度有关的政策更新来应对这一规划问题。根据这些普遍的政策更新,我们显示在更新的州密度必要和充分的条件下,在最佳条件下实现政策趋同,从而解决上述规划问题。我们还证明,在政策梯度文献中,政策趋同率的简单化速度大大改进了政策梯度文献中的数据。为了执行我们理论规定的原则,我们建议一个具有双重人格的代理机构,Jekyll博士和Hyde先生(JH)博士(JH)博士和Hyde先生(JH)博士(JH)将政策梯度理论纯粹加以利用,而Hyde先生则纯粹探索。根据这些普遍的政策更新,JHKL的测试能力分析可以记录两个单独的缓冲:一个是政策(Dr Jkykilling),一个是政策(D)和JDL)下的最新政策更新。

0

相关内容

磁流变材料

磁流变材料

磁流变（Magnetorheological，简称MR）材料是一种流变性能可由磁场控制的新型智能材料。由于其响应快（ms量级）、可逆性好（撤去磁场后，又恢复初始状态）、以及通过调节磁场大小来控制材料的力学性能连续变化，因而近年来在汽车、建筑、振动控制等领域得到广泛应用。

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

Yoshua Bengio，使算法知道“为什么”

Yoshua Bengio，使算法知道“为什么”

专知会员服务

8+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

【推荐】直接未来预测：增强学习监督学习

【推荐】直接未来预测：增强学习监督学习

机器学习研究会

6+阅读 · 2017年11月24日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

One More Step Towards Reality: Cooperative Bandits with Imperfect Communication

Arxiv

0+阅读 · 2021年11月24日

Long-Term CSI-based Design for RIS-Aided Multiuser MISO Systems Exploiting Deep Reinforcement Learning

Arxiv

0+阅读 · 2021年11月24日

Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

Arxiv

0+阅读 · 2021年11月22日

A Free Lunch from the Noise: Provable and Practical Exploration for Representation Learning

Arxiv

0+阅读 · 2021年11月22日

Density Constrained Reinforcement Learning

Arxiv

6+阅读 · 2021年6月24日

Attention Forcing for Sequence-to-sequence Model Training

Attention Forcing for Sequence-to-sequence Model Training

Arxiv

7+阅读 · 2019年9月26日

Paraphrase Generation with Deep Reinforcement Learning

Paraphrase Generation with Deep Reinforcement Learning

Arxiv

4+阅读 · 2018年8月23日

Variational Bayesian Reinforcement Learning with Regret Bounds

Arxiv

3+阅读 · 2018年7月25日

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences

Arxiv

5+阅读 · 2018年7月23日

Logically-Constrained Reinforcement Learning

Arxiv

5+阅读 · 2018年4月22日

VIP会员

文章信息

相关主题

磁流变材料

相关VIP内容

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

Yoshua Bengio，使算法知道“为什么”

Yoshua Bengio，使算法知道“为什么”

专知会员服务

8+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【牛津博士论文】零样本强化学习综述

《美军条令：陆军指挥官与规划人员地理空间指南》60页

战术边缘指挥控制：防务面临的核心挑战

迈向开放世界检测：综述

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

vae 相关论文表示学习 1

vae 相关论文表示学习 1

CreateAMind

12+阅读 · 2018年9月6日

carla 学习笔记

carla 学习笔记

CreateAMind

9+阅读 · 2018年2月7日

【推荐】直接未来预测：增强学习监督学习

【推荐】直接未来预测：增强学习监督学习

机器学习研究会

6+阅读 · 2017年11月24日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

One More Step Towards Reality: Cooperative Bandits with Imperfect Communication

Arxiv

0+阅读 · 2021年11月24日

Long-Term CSI-based Design for RIS-Aided Multiuser MISO Systems Exploiting Deep Reinforcement Learning

Arxiv

0+阅读 · 2021年11月24日

Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments

Arxiv

0+阅读 · 2021年11月22日

A Free Lunch from the Noise: Provable and Practical Exploration for Representation Learning

Arxiv

0+阅读 · 2021年11月22日

Density Constrained Reinforcement Learning

Arxiv

6+阅读 · 2021年6月24日

Attention Forcing for Sequence-to-sequence Model Training

Attention Forcing for Sequence-to-sequence Model Training

Arxiv

7+阅读 · 2019年9月26日

Paraphrase Generation with Deep Reinforcement Learning

Paraphrase Generation with Deep Reinforcement Learning

Arxiv

4+阅读 · 2018年8月23日

Variational Bayesian Reinforcement Learning with Regret Bounds

Arxiv

3+阅读 · 2018年7月25日

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences

Arxiv

5+阅读 · 2018年7月23日

Logically-Constrained Reinforcement Learning

Arxiv

5+阅读 · 2018年4月22日

微信扫码咨询专知VIP会员