小额预设条件的离政策学习:加强学习的新探索方法 (Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning) - 专知论文

会员服务 ·

0

Sigmoid（一种激活函数） · 重要性采样 · 学成 · 方差 · 强化学习 ·

2022 年 5 月 20 日

Sigmoidally Preconditioned Off-policy Learning:a new exploration method for reinforcement learning

翻译：小额预设条件的离政策学习:加强学习的新探索方法

Xing Chen,Dongcui Diao,Hechang Chen,Hengshuai Yao,Jielong Yang,Haiyin Piao,Zhixiao Sun,Bei Jiang,Yi Chang

One of the major difficulties of reinforcement learning is learning from {\em off-policy} samples, which are collected by a different policy (behavior policy) from what the algorithm evaluates (the target policy). Off-policy learning needs to correct the distribution of the samples from the behavior policy towards that of the target policy. Unfortunately, important sampling has an inherent high variance issue which leads to poor gradient estimation in policy gradient methods. We focus on an off-policy Actor-Critic architecture, and propose a novel method, called Preconditioned Proximal Policy Optimization (P3O), which can control the high variance of importance sampling by applying a preconditioner to the Conservative Policy Iteration (CPI) objective. {\em This preconditioning uses the sigmoid function in a special way that when there is no policy change, the gradient is maximal and hence policy gradient will drive a big parameter update for an efficient exploration of the parameter space}. This is a novel exploration method that has not been studied before given that existing exploration methods are based on the novelty of states and actions. We compare with several best-performing algorithms on both discrete and continuous tasks and the results confirmed that {\em P3O is more off-policy than PPO} according to the "off-policyness" measured by the DEON metric, and P3O explores in a larger policy space than PPO. Results also show that our P3O maximizes the CPI objective better than PPO during the training process.

翻译：强化学习的主要困难之一是从算法所评估(目标政策)的不同政策(行为政策)采集的样本中学习强化学习的主要困难之一,这些样本是从算法所评估(目标政策)的不同政策(行为政策)中收集的。离政策学习需要纠正行为政策样本的分配情况, 向目标政策学习。不幸的是, 重要的抽样具有固有的差异性, 导致政策梯度方法的梯度估计不力。我们侧重于一个离政策Act- critical- critical 结构, 并提出了一种新颖的方法, 称为P3O, 称为P3P3, 这种方法可以控制重要性的高度差异, 通过对保守政策循环(CPI)目标应用一个先决条件来控制。 ~这个先决条件使用示意函数的特殊方式是,在没有政策变化的情况下, 梯度是最大化的,因此政策梯度将驱动一个大参数更新,以高效探索参数空间。这是一个新的探索方法,以前未曾研究过,因为现有的探索方法是以国家和行动的新颖的特性为基础,可以控制重要性的高度差异采样。我们用一些最佳的P-DE培训结果比PPP3 更能显示离离离PPPPPPPPPPPPP3, 更精确和PPPPPPPPPPPPP3 比较了比实际政策的结果。

0

相关内容

Sigmoid（一种激活函数）

Sigmoid（一种激活函数）

【CVPR 2022】AME：超参数优化中的注意力和记忆增强，AME: Attention and Memory Enhancement in Hyper-Parameter Optimization

【CVPR 2022】AME：超参数优化中的注意力和记忆增强，AME: Attention and Memory Enhancement in Hyper-Parameter Optimization

专知会员服务

11+阅读 · 2022年3月19日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强磁场下LaVO3/SrTiO3界面二维电子态的量子现象研究

国家自然科学基金

0+阅读 · 2015年12月31日

(CexA1-x)2Ti2O7 (A=Y, Gd, Lu; x=0-1)的制备及离子束辐照效应研究

国家自然科学基金

0+阅读 · 2014年12月31日

GPS和GRACE联合获取藏南垂直形变的关键问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

银基响应型粒子自组装制备光子晶体及其双效增强荧光的可控调节

国家自然科学基金

0+阅读 · 2012年12月31日

几何阻挫体系ATO2中自旋、电荷、轨道序及其相互作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

Fermi伽玛射线脉冲星及脉冲星风云的高能物理特性

国家自然科学基金

0+阅读 · 2011年12月31日

超导金属纳米环的制备及低温STM/STS研究

国家自然科学基金

0+阅读 · 2009年12月31日

HIC2基因在中国汉族人系统性红斑狼疮易感性作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

紫/蓝光激发的稀土硅氧氮化物荧光材料

国家自然科学基金

0+阅读 · 2009年12月31日

X射线双星的时变与能谱性质

国家自然科学基金

0+阅读 · 2009年12月31日

Retro-RL: Reinforcing Nominal Controller With Deep Reinforcement Learning for Tilting-Rotor Drones

Arxiv

0+阅读 · 2022年7月7日

DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning

DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning

Arxiv

1+阅读 · 2022年7月7日

Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems

Arxiv

0+阅读 · 2022年7月6日

Two-Sample Testing in Reinforcement Learning

Arxiv

0+阅读 · 2022年7月6日

Transformers are Meta-Reinforcement Learners

Arxiv

15+阅读 · 2022年6月14日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey on Reinforcement Learning for Recommender Systems

Arxiv

22+阅读 · 2021年9月22日

Recent Advances in Deep Learning-based Dialogue Systems

Arxiv

18+阅读 · 2021年5月10日

MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration

Arxiv

12+阅读 · 2021年2月7日

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

Arxiv

34+阅读 · 2019年10月24日

VIP会员

文章信息

相关主题

Sigmoid（一种激活函数）

重要性采样

相关VIP内容

【CVPR 2022】AME：超参数优化中的注意力和记忆增强，AME: Attention and Memory Enhancement in Hyper-Parameter Optimization

【CVPR 2022】AME：超参数优化中的注意力和记忆增强，AME: Attention and Memory Enhancement in Hyper-Parameter Optimization

专知会员服务

11+阅读 · 2022年3月19日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium9

中国图象图形学学会CSIG

0+阅读 · 2021年12月17日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium6

中国图象图形学学会CSIG

2+阅读 · 2021年11月12日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium4

中国图象图形学学会CSIG

0+阅读 · 2021年11月10日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium3

中国图象图形学学会CSIG

0+阅读 · 2021年11月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Retro-RL: Reinforcing Nominal Controller With Deep Reinforcement Learning for Tilting-Rotor Drones

Arxiv

0+阅读 · 2022年7月7日

DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning

DRL-ISP: Multi-Objective Camera ISP with Deep Reinforcement Learning

Arxiv

1+阅读 · 2022年7月7日

Deep Learning Approximation of Diffeomorphisms via Linear-Control Systems

Arxiv

0+阅读 · 2022年7月6日

Two-Sample Testing in Reinforcement Learning

Arxiv

0+阅读 · 2022年7月6日

Transformers are Meta-Reinforcement Learners

Arxiv

15+阅读 · 2022年6月14日

Recent Advances in Reinforcement Learning in Finance

Arxiv

11+阅读 · 2021年12月8日

A Survey on Reinforcement Learning for Recommender Systems

Arxiv

22+阅读 · 2021年9月22日

Recent Advances in Deep Learning-based Dialogue Systems

Arxiv

18+阅读 · 2021年5月10日

MetaCURE: Meta Reinforcement Learning with Empowerment-Driven Exploration

Arxiv

12+阅读 · 2021年2月7日

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning

Arxiv

34+阅读 · 2019年10月24日

相关基金

强磁场下LaVO3/SrTiO3界面二维电子态的量子现象研究

国家自然科学基金

0+阅读 · 2015年12月31日

(CexA1-x)2Ti2O7 (A=Y, Gd, Lu; x=0-1)的制备及离子束辐照效应研究

国家自然科学基金

0+阅读 · 2014年12月31日

GPS和GRACE联合获取藏南垂直形变的关键问题研究

国家自然科学基金

0+阅读 · 2013年12月31日

银基响应型粒子自组装制备光子晶体及其双效增强荧光的可控调节

国家自然科学基金

0+阅读 · 2012年12月31日

几何阻挫体系ATO2中自旋、电荷、轨道序及其相互作用研究

国家自然科学基金

0+阅读 · 2011年12月31日

Fermi伽玛射线脉冲星及脉冲星风云的高能物理特性

国家自然科学基金

0+阅读 · 2011年12月31日

超导金属纳米环的制备及低温STM/STS研究

国家自然科学基金

0+阅读 · 2009年12月31日

HIC2基因在中国汉族人系统性红斑狼疮易感性作用研究

国家自然科学基金

0+阅读 · 2009年12月31日

紫/蓝光激发的稀土硅氧氮化物荧光材料

国家自然科学基金

0+阅读 · 2009年12月31日

X射线双星的时变与能谱性质

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员