离政策学习的连链增值功能 (Chaining Value Functions for Off-Policy Learning) - 专知论文

会员服务 ·

0

估计/估计量 · 学成 · off-policy · 价值函数 · TD ·

2022 年 1 月 17 日

Chaining Value Functions for Off-Policy Learning

翻译：离政策学习的连链增值功能

Simon Schmitt,John Shawe-Taylor,Hado van Hasselt

To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.

翻译：为了积累知识和改进其行为政策,强化学习机构可以学习与产生其经验的政策不同的政策“政策”。这很重要,可以学习反事实,或者因为经验是自己控制产生的。然而,脱离政策学习是非三重性的,标准强化学习算法可以是不稳定和差异的。在本文中,我们讨论的是一个新的非政策预测算法,这种算法通过构建而趋同。想法是首先在政策上学习关于数据生成行为的政策,然后在这个政策上的估计中设置一个离政策值的估算,从而得出一个部分离政策的价值估计。这很重要,可以用来学习反事实,或者因为经验是源于自身的控制。但是,离政策学习不是三重的,而标准强化学习算法是非三重的,标准强化学习算法可能是不稳定和不同的。在本文中,我们讨论的是“离政策”的新的组合,在评估链的长度时,它会任意接近于非政策解决方案。因此,即使在政策上的偏离目标之前,它也可以算出一个非政策值的估算值值值值。我们证明,在评估一个核心的矩阵时,我们可以将一个核心的模型与一个方向相近。

0

相关内容

估计/估计量

估计/估计量

机器学习损失函数概述，Loss Functions in Machine Learning

机器学习损失函数概述，Loss Functions in Machine Learning

专知会员服务

83+阅读 · 2022年3月19日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【伯克利，基于模型的强化学习：理论与实践】《Model-Based Reinforcement Learning:Theory and Practice》，Michael Janner

【伯克利，基于模型的强化学习：理论与实践】《Model-Based Reinforcement Learning:Theory and Practice》，Michael Janner

专知会员服务

35+阅读 · 2019年12月12日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

城市轨道交通网络限流与列车调整协同优化建模与仿真

国家自然科学基金

2+阅读 · 2014年12月31日

随机辛算法和多辛算法

国家自然科学基金

2+阅读 · 2014年12月31日

复杂生产环境下的随机客户订单调度问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

高动态编队无人机自主高精度时间同步方法研究

国家自然科学基金

11+阅读 · 2013年12月31日

基于塑性铰外移的CFRP板条嵌入式加固RC框架节点抗震设计方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

金融与管理中的HJB方程组的高效有限元方法

国家自然科学基金

0+阅读 · 2013年12月31日

利用参量结构实现复杂信号环境下盲信号分离方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

复杂网络拓扑结构抗毁性的谱测度研究

国家自然科学基金

0+阅读 · 2009年12月31日

p进表示的伽罗瓦上同调

国家自然科学基金

0+阅读 · 2008年12月31日

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Arxiv

0+阅读 · 2022年4月20日

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年4月19日

Duality-based Convex Optimization for Real-time Obstacle Avoidance between Polytopes with Control Barrier Functions

Arxiv

0+阅读 · 2022年4月18日

On Parametric Optimal Execution and Machine Learning Surrogates

Arxiv

0+阅读 · 2022年4月18日

Active Learning with Weak Labels for Gaussian Processes

Arxiv

2+阅读 · 2022年4月18日

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

Arxiv

1+阅读 · 2022年4月18日

Characterizing metastable states with the help of machine learning

Arxiv

0+阅读 · 2022年4月15日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

Adaptive Synthetic Characters for Military Training

Adaptive Synthetic Characters for Military Training

Arxiv

49+阅读 · 2021年1月6日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

VIP会员

文章信息

相关主题

估计/估计量

相关VIP内容

机器学习损失函数概述，Loss Functions in Machine Learning

机器学习损失函数概述，Loss Functions in Machine Learning

专知会员服务

83+阅读 · 2022年3月19日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【伯克利，基于模型的强化学习：理论与实践】《Model-Based Reinforcement Learning:Theory and Practice》，Michael Janner

【伯克利，基于模型的强化学习：理论与实践】《Model-Based Reinforcement Learning:Theory and Practice》，Michael Janner

专知会员服务

35+阅读 · 2019年12月12日

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

【机器学习基础最新版】（Mathematics for Machine Learning），417页pdf

专知会员服务

244+阅读 · 2019年10月21日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

新书册《几何深度学习的数学基础》

中程单向攻击无人机的战略意义：俄乌战争启示

在无标注条件下适配视觉—语言模型：全面综述

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

相关资讯

IEEE ICKG 2022: Call for Papers

IEEE ICKG 2022: Call for Papers

机器学习与推荐算法

3+阅读 · 2022年3月30日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

A Two-Time-Scale Stochastic Optimization Framework with Applications in Control and Reinforcement Learning

Arxiv

0+阅读 · 2022年4月20日

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Arxiv

0+阅读 · 2022年4月19日

Duality-based Convex Optimization for Real-time Obstacle Avoidance between Polytopes with Control Barrier Functions

Arxiv

0+阅读 · 2022年4月18日

On Parametric Optimal Execution and Machine Learning Surrogates

Arxiv

0+阅读 · 2022年4月18日

Active Learning with Weak Labels for Gaussian Processes

Arxiv

2+阅读 · 2022年4月18日

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

Configuration-Aware Safe Control for Mobile Robotic Arm with Control Barrier Functions

Arxiv

1+阅读 · 2022年4月18日

Characterizing metastable states with the help of machine learning

Arxiv

0+阅读 · 2022年4月15日

Bayesian Deep Learning for Graphs

Arxiv

23+阅读 · 2022年2月24日

Adaptive Synthetic Characters for Military Training

Adaptive Synthetic Characters for Military Training

Arxiv

49+阅读 · 2021年1月6日

Learning Discrete Structures for Graph Neural Networks

Arxiv

17+阅读 · 2019年3月28日

相关基金

复杂数据模型中的分布逼近方法

国家自然科学基金

3+阅读 · 2014年12月31日

城市轨道交通网络限流与列车调整协同优化建模与仿真

国家自然科学基金

2+阅读 · 2014年12月31日

随机辛算法和多辛算法

国家自然科学基金

2+阅读 · 2014年12月31日

复杂生产环境下的随机客户订单调度问题研究

国家自然科学基金

0+阅读 · 2014年12月31日

高动态编队无人机自主高精度时间同步方法研究

国家自然科学基金

11+阅读 · 2013年12月31日

基于塑性铰外移的CFRP板条嵌入式加固RC框架节点抗震设计方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

金融与管理中的HJB方程组的高效有限元方法

国家自然科学基金

0+阅读 · 2013年12月31日

利用参量结构实现复杂信号环境下盲信号分离方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

复杂网络拓扑结构抗毁性的谱测度研究

国家自然科学基金

0+阅读 · 2009年12月31日

p进表示的伽罗瓦上同调

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员