We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping $\boldsymbol{\phi}(s,a)$. Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB$^+$, which achieves an $\widetilde{O}(Hd\sqrt{T})$ regret bound where $H$ is the episode length, $d$ is the feature dimension, and $T$ is the number of steps. LSVI-UCB$^+$ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. To the best of our knowledge, this is the first minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the $\sqrt{Hd}$ gap between the best known upper bound of $\widetilde{O}(\sqrt{H^3d^3T})$ in \cite{jin2020provably} and lower bound of $\Omega(Hd\sqrt{T})$ for linear MDPs.
翻译:我们研究以线性函数近似值加强学习, 过渡概率和奖励功能在功能映射 $\ boldsymbol_phi}( a) 方面是线性 。 具体地说, 我们考虑单数异异异异异异线性线性Markov 决策程序( MDP ), 并提出一种新的计算效率算法( LSVI- UBBB$ $ $ ), 实现 $( 全方位=O}( Hd\ sqrt{T} ) 的自我调整, 对修正术语进行精细化分析。 据我们所知, 这是用于直线式 MDP 直至对数数数的首个微微最佳算法 。 LSVI- UCB$ 和高置信度值以Bernstein型勘探奖赏( MSVI- Urt_H} 以新的分析工具获得我们的统计结果, 包括一个新的伯恩斯坦自我调整, 套套套套在 $\\\\\\ rq} MRC} 美元中, 最深的里端的 3xxxxxxxxxxxxxxxxxxx