国家分布不均下的软性非政策作用者批评品全球最佳和有限抽样分析 (Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch) - 专知论文

会员服务 ·

0

全局优化 · 评论员 · 优化器 · 近似 · Softmax ·

2021 年 11 月 4 日

Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch

翻译：国家分布不均下的软性非政策作用者批评品全球最佳和有限抽样分析

Shangtong Zhang,Remi Tachet,Romain Laroche

In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties.

翻译：在本文中,我们在表格设置中,在没有使用密度比率来纠正行为政策与目标政策之间国家分布差异的情况下,确定一个非政策性行为者批评算法的全球最佳性和趋同率。我们的工作超越了政策梯度方法最佳性的现有工作,因为现有工作使用精确的政策梯度来更新政策参数,而我们则使用一个近似和随机更新步骤。我们更新的步骤不是一个梯度更新,因为我们没有使用密度比率来纠正国家分布,这与从业者所做的完全吻合。我们更新是近似,因为我们使用了一个学习的批评家,而不是真正的价值函数。我们更新是随机的,因为每一步更新都是针对当前对州行动的。此外,我们删除了我们分析中现有工作中的若干限制性假设。我们工作的核心是对具有时间-不相容性马尔科夫链操作者基于其统一的收缩特性的具有时间-不相容性的最新操作者进行有限的抽样分析。

0

相关内容

全局优化

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

专知会员服务

85+阅读 · 2019年10月29日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

Stochastic Saddle Point Problems with Decision-Dependent Distributions

Arxiv

0+阅读 · 2022年1月7日

Local and Global Convergence of General Burer-Monteiro Tensor Optimizations

Arxiv

0+阅读 · 2022年1月7日

Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

Arxiv

0+阅读 · 2022年1月4日

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

Arxiv

0+阅读 · 2022年1月4日

A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning

Arxiv

0+阅读 · 2022年1月3日

Attributed Network Embedding via Subspace Discovery

Arxiv

4+阅读 · 2019年1月14日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Arxiv

7+阅读 · 2018年6月12日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Parameter Space Noise for Exploration

Arxiv

3+阅读 · 2018年1月31日

VIP会员

文章信息

相关主题

相关VIP内容

分布外泛化(Out-Of-Distribution Generalization) 综述论文，22页pdf240篇文献

专知会员服务

64+阅读 · 2021年9月2日

《算法凸几何》简明书，Algorithmic Convex Geometry，50页pdf

专知会员服务

42+阅读 · 2021年4月2日

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

【经典书】计算最优传输，209页pdf，Computational Optimal Transport

专知会员服务

75+阅读 · 2021年1月10日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

深度强化学习策略梯度教程，53页ppt

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

【课程】普林斯顿大学19年春季学期《机器学习优化》课程讲义

专知会员服务

85+阅读 · 2019年10月29日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

【伯克利博士论文】通过真实世界实践赋能机器人自主性

军用无人机集群技术尚未成熟——但潜力可期

人工智能安全治理白皮书（2025）

AgentOps综述：分类、挑战与未来方向

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

强化学习 cartpole_a3c

强化学习 cartpole_a3c

CreateAMind

9+阅读 · 2017年7月21日

相关论文

Stochastic Saddle Point Problems with Decision-Dependent Distributions

Arxiv

0+阅读 · 2022年1月7日

Local and Global Convergence of General Burer-Monteiro Tensor Optimizations

Arxiv

0+阅读 · 2022年1月7日

Value Functions Factorization with Latent State Information Sharing in Decentralized Multi-Agent Policy Gradients

Arxiv

0+阅读 · 2022年1月4日

Variance-Aware Off-Policy Evaluation with Linear Function Approximation

Arxiv

0+阅读 · 2022年1月4日

A Deeper Understanding of State-Based Critics in Multi-Agent Reinforcement Learning

Arxiv

0+阅读 · 2022年1月3日

Attributed Network Embedding via Subspace Discovery

Arxiv

4+阅读 · 2019年1月14日

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

PPO-CMA: Proximal Policy Optimization with Covariance Matrix Adaptation

Arxiv

8+阅读 · 2018年12月18日

A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics

Arxiv

7+阅读 · 2018年6月12日

Reinforcement Learning for Solving the Vehicle Routing Problem

Arxiv

3+阅读 · 2018年5月21日

Parameter Space Noise for Exploration

Arxiv

3+阅读 · 2018年1月31日

微信扫码咨询专知VIP会员