Modern tasks in reinforcement learning have large state and action spaces. To deal with them efficiently, one often uses predefined feature mapping to represent states and actions in a low-dimensional space. In this paper, we study reinforcement learning for discounted Markov Decision Processes (MDPs), where the transition kernel can be parameterized as a linear function of certain feature mapping. We propose a novel algorithm that makes use of the feature mapping and obtains a $\tilde O(d\sqrt{T}/(1-\gamma)^2)$ regret, where $d$ is the dimension of the feature space, $T$ is the time horizon and $\gamma$ is the discount factor of the MDP. To the best of our knowledge, this is the first polynomial regret bound without accessing the generative model or making strong assumptions such as ergodicity of the MDP. By constructing a special class of MDPs, we also show that for any algorithms, the regret is lower bounded by $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $(1-\gamma)^{-0.5}$ factor.
翻译:强化学习的现代任务具有很大的状态和行动空间。 要有效处理它们, 人们经常使用预定义的特征映射来代表低维空间的国家和行动。 在本文中, 我们研究对折扣的 Markov 决策进程( MDPs) 的强化学习, 在那里, 过渡内核可以作为某些特征映射的线性功能参数化。 我们提出一种新的算法, 利用特征映射并获得 $\ tde O (d\ sqrt{T}/ (1-\ gamma) $2 的遗憾, 其中地块空间的维度为$, $T$ 是时空, $\ gammas是 MDP 的折扣系数。 对我们所知, 这是第一个不使用基因模型或作出诸如 MDP ERGatity 的强烈假设的多元性遗憾捆绑在一起。 通过构建一个特殊的 MDPs 类, 我们还表明, 对于任何算法来说, 其悔恨度都较低, 由$\ Omega (dqrt{T} / (1-\\ gamamamamamama) sult asult leginal legate legal) agate legregal galgate agalgalgalgalgalgalgalgalslate.