We study reinforcement learning in an infinite-horizon average-reward setting with linear function approximation, where the transition probability function of the underlying Markov Decision Process (MDP) admits a linear form over a feature mapping of the current state, action, and next state. We propose a new algorithm UCRL2-VTR, which can be seen as an extension of the UCRL2 algorithm with linear function approximation. We show that UCRL2-VTR with Bernstein-type bonus can achieve a regret of $\tilde{O}(d\sqrt{DT})$, where $d$ is the dimension of the feature mapping, $T$ is the horizon, and $\sqrt{D}$ is the diameter of the MDP. We also prove a matching lower bound $\tilde{\Omega}(d\sqrt{DT})$, which suggests that the proposed UCRL2-VTR is minimax optimal up to logarithmic factors. To the best of our knowledge, our algorithm is the first nearly minimax optimal RL algorithm with function approximation in the infinite-horizon average-reward setting.
翻译:我们用线性函数近似值来研究在无限偏差平均回报环境下的强化学习, 基底的Markov 决策程序( MDP) 的过渡概率功能在当前状态、 动作和下一状态的特征映射中承认一种线性形式。 我们提出一个新的算法 UCRL2- VTR, 这个算法可以被视为UCRL2 算法的延伸, 具有线性函数近似值。 我们显示, 伯恩斯坦式奖金的 UCRL2- VTR 能够取得$\tilde{O}(d\ sqrt{DT}) 的遗憾, 美元是地平面绘图的维度, $T$是地平面, $\ sqrt{D} 是MDP的直径。 我们还证明, 与低约束的 $\tilde {Omega} (d\qrt{DT} 相匹配, 这表明, 拟议的 UCRCRL2- VTR 最接近于对正值。 据我们所知, 我们的算算算算法是第一个近微小型最佳RL 最优化的RL 算法。