与一般空间一起学习MDP的Q-学习:通过在低连续力下量化实现聚合和接近优化 (Q-Learning for MDPs with General Spaces: Convergence and Near Optimality via Quantization under Weak Continuity)

Reinforcement learning algorithms often require finiteness of state and action spaces in Markov decision processes (MDPs) and various efforts have been made in the literature towards the applicability of such algorithms for continuous state and action spaces. In this paper, we show that under very mild regularity conditions (in particular, involving only weak continuity of the transition kernel of an MDP), Q-learning for standard Borel MDPs via quantization of states and actions converge to a limit, and furthermore this limit satisfies an optimality equation which leads to near optimality with either explicit performance bounds or which are guaranteed to be asymptotically optimal. Our approach builds on (i) viewing quantization as a measurement kernel and thus a quantized MDP as a POMDP, (ii) utilizing near optimality and convergence results of Q-learning for POMDPs, and (iii) finally, near-optimality of finite state model approximations for MDPs with weakly continuous kernels which we show to correspond to the fixed point of the constructed POMDP. Thus, our paper presents a very general convergence and approximation result for the applicability of Q-learning for continuous MDPs.

翻译：强化学习算法往往要求在Markov决策程序(MDPs)中限定州和行动空间,文献中也为这种算法适用于连续状态和行动空间作出了各种努力。在本文中,我们表明,在非常温和的常规条件下(特别是涉及MDP过渡核心的连续性薄弱,尤其是涉及MDP过渡核心的连续性薄弱),通过对州和行动的量化,为标准Borel MDP进行Q学习,使标准Borel MDP达到一个限度,而且这一限度还满足了一种最优化的方程式,这种方程式导致接近最佳性,要么有明确的性能约束,要么保证不具有最佳性能保障。我们的方法基于:(一) 将量化视为测量核心,从而将MDP作为POMDP量化,(二) 利用POMDPsQ学习的接近最佳性和趋同结果,以及(三) 最后,我们所显示的MOMDP与已建的固定点相对应的微连续内核质。因此,我们的文件展示了MOMDP的持续趋同性。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【牛津大学ICLR2020】通过元学习的贝叶斯自适应深度RL, VariBAD: A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning

专知会员服务

25+阅读 · 2020年2月28日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日

Risk Sensitive Portfolio Optimization with Regime-Switching and Default Contagion，香港理工大学应用数学系余翔助理教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

10+阅读 · 2019年10月24日