非政策性和软调整的充足性:根据一项非政策措施,PPP不够充分 (The Sufficiency of Off-policyness and Soft Clipping: PPO is insufficient according to an Off-policy Measure) - 专知论文

会员服务 ·

0

off-policy · 优化器 · SOFT · Better · 泛函 ·

2022 年 8 月 8 日

The Sufficiency of Off-policyness and Soft Clipping: PPO is insufficient according to an Off-policy Measure

翻译：非政策性和软调整的充足性:根据一项非政策措施,PPP不够充分

Xing Chen,Dongcui Diao,Hechang Chen,Hengshuai Yao,Jielong Yang,Haiyin Piao,Zhixiao Sun,Bei Jiang,Yi Chang

Many policy gradient methods optimize the objective, $\max_{\pi}E_{\pi}[A_{\pi_{old}}(s,a)]$, where $A_{\pi_{old}}$ is the advantage function of the old policy. The objective is not feasible to be directly optimized because we don't have samples for the new policy yet. Thus the importance sampling (IS) ratio arises, giving an IS corrected objective or the CPI objective, $\max_{\pi}E_{\pi_{old}}[\frac{\pi(s,a)}{\pi_{old}(s,a)}A_{\pi_{old}}(s,a)]$. However, optimizing this objective is still problematic due to extremely large IS ratios that can cause algorithms to fail catastrophically. Thus PPO uses a surrogate objective, and seeks an approximation to the solution in a clipped policy space, $\Pi_{\epsilon}=\{\pi; |\frac{\pi(s,a)}{\pi_{old}(s,a)}-1|<\epsilon \}$, where $\epsilon$ is a small positive number. One question that drives this paper is, {\em How grounded is this hypothesis that $\Pi_{\epsilon}$ contains good enough policies?} {\bfseries Does there exist better policies outside of $\mathbf{\Pi_{\epsilon}}$?} Using a novel surrogate objective that employs the sigmoid function resulting in an interesting way of exploration, we found that there indeed exists much better policies out of $\Pi_{\epsilon}$; In addition, these policies are located very far from it. We compare with several best-performing algorithms on both discrete and continuous tasks and the results showed that {\em PPO is insufficient in off-policyness}, and our new method P3O is {\em more off-policy} than PPO according to the "off-policyness" measured by the {\em DEON off-policy metric}, and P3O {\em \bfseries explores in a much larger policy space} than PPO.

翻译：许多政策梯度方法优化了目标, $max ⁇ pi}E ⁇ pí}[A ⁇ pi ⁇ old ⁇ (a,a)]$A ⁇ pi ⁇ old ⁇ (a,a)$是旧政策的好处功能。目标无法直接优化, 因为我们还没有新政策样本。因此, 重要取样( IS) 比率产生, 给 IS 纠正目标或CPI 目标, $maxíp} E ⁇ pi} [\\ precicial $(s,a)\\pieold} (s,a) A ⁇ pi{d} 美元(s) a) a) a (a) a (a) a (d) a (d) a (d) a (d) a (d) a (d) a (d) (d) (d) (d) (d) (d) (d) (a) (d) (d) (d (d) (d (d) (d) (d) (d (e) (d) (e) (e) (e) (e (e (e) (e) (e) (e) (e) (e) (e) (d) (e (e) (e) (e) (e) (e) (e) (e) (e) (e (e) (e) (e) (e) (a) (a) (a) (a) (a) (d) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (a) (d) (a) (a) (a) (a) (a) (a) (a) (a) (d) (d) (a) (d) (d) (d) (d) (d) (d)) (d) (d) (d) (d) (d) (d) (d) (d) (a) (a) (d) (d

0

相关内容

off-policy

【2022新书】强化学习工业应用，408页pdf

【2022新书】强化学习工业应用，408页pdf

专知会员服务

231+阅读 · 2022年2月3日

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

转录激活蛋白YLGat1介导氮饥饿与油脂合成偶联的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

TV-miR-200b/c靶向抑制HER2/HER3克服乳腺癌对赫赛汀耐药

国家自然科学基金

0+阅读 · 2014年12月31日

NDRG2介导泛素化蛋白降解途径在抑制HER2阳性乳腺癌耐药中的作用及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

靶向微管蛋白秋水仙碱位点的白藜芦醇-Combrestatin A-4类抑制剂的设计、合成及活性研究

国家自然科学基金

0+阅读 · 2013年12月31日

Cofilin在Erucin诱导的乳腺癌细胞线粒体分裂和细胞凋亡中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

光纤陀螺偏振态耦合的热损伤机理与在线补偿控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

miR-140在肿瘤转移中的作用及机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

滋养细胞合体化障碍参与恶性滋养细胞肿瘤耐药机制

国家自然科学基金

0+阅读 · 2009年12月31日

CAPE抑制EMT信号途径逆转大肠癌耐药性的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Arxiv

0+阅读 · 2022年10月6日

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Arxiv

0+阅读 · 2022年10月6日

The Power of Duality: Response Time Analysis meets Integer Programming

Arxiv

0+阅读 · 2022年10月5日

Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Arxiv

0+阅读 · 2022年10月4日

Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient

Arxiv

0+阅读 · 2022年10月3日

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

Arxiv

0+阅读 · 2022年10月3日

Scalable Safety-Critical Policy Evaluation with Accelerated Rare Event Sampling

Arxiv

0+阅读 · 2022年10月2日

Families of sequences with good family complexity and cross-correlation measure

Arxiv

0+阅读 · 2022年10月1日

A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

Arxiv

1+阅读 · 2022年9月30日

Safe Exploration Method for Reinforcement Learning under Existence of Disturbance

Arxiv

0+阅读 · 2022年9月30日

VIP会员

文章信息

相关主题

相关VIP内容

【2022新书】强化学习工业应用，408页pdf

【2022新书】强化学习工业应用，408页pdf

专知会员服务

231+阅读 · 2022年2月3日

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

VCIP 2022 Call for Special Session Proposals

VCIP 2022 Call for Special Session Proposals

CCF多媒体专委会

1+阅读 · 2022年4月1日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias

Arxiv

0+阅读 · 2022年10月6日

Pessimistic Model-based Offline Reinforcement Learning under Partial Coverage

Arxiv

0+阅读 · 2022年10月6日

The Power of Duality: Response Time Analysis meets Integer Programming

Arxiv

0+阅读 · 2022年10月5日

Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Arxiv

0+阅读 · 2022年10月4日

Offline Reinforcement Learning with Differentiable Function Approximation is Provably Efficient

Arxiv

0+阅读 · 2022年10月3日

Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation

Arxiv

0+阅读 · 2022年10月3日

Scalable Safety-Critical Policy Evaluation with Accelerated Rare Event Sampling

Arxiv

0+阅读 · 2022年10月2日

Families of sequences with good family complexity and cross-correlation measure

Arxiv

0+阅读 · 2022年10月1日

A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

A General Framework for Sample-Efficient Function Approximation in Reinforcement Learning

Arxiv

1+阅读 · 2022年9月30日

Safe Exploration Method for Reinforcement Learning under Existence of Disturbance

Arxiv

0+阅读 · 2022年9月30日

相关基金

转录激活蛋白YLGat1介导氮饥饿与油脂合成偶联的分子机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

TV-miR-200b/c靶向抑制HER2/HER3克服乳腺癌对赫赛汀耐药

国家自然科学基金

0+阅读 · 2014年12月31日

NDRG2介导泛素化蛋白降解途径在抑制HER2阳性乳腺癌耐药中的作用及机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

靶向微管蛋白秋水仙碱位点的白藜芦醇-Combrestatin A-4类抑制剂的设计、合成及活性研究

国家自然科学基金

0+阅读 · 2013年12月31日

Cofilin在Erucin诱导的乳腺癌细胞线粒体分裂和细胞凋亡中的作用及分子机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

光纤陀螺偏振态耦合的热损伤机理与在线补偿控制研究

国家自然科学基金

0+阅读 · 2013年12月31日

RANK-钙离子ATP酶新机制阻止足细胞损伤的研究

国家自然科学基金

0+阅读 · 2012年12月31日

miR-140在肿瘤转移中的作用及机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

滋养细胞合体化障碍参与恶性滋养细胞肿瘤耐药机制

国家自然科学基金

0+阅读 · 2009年12月31日

CAPE抑制EMT信号途径逆转大肠癌耐药性的分子机制

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员