We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs), i.e, both the reward and transition kernel are linear with respect to a given feature map and are allowed to evolve either slowly or abruptly over time. For this problem setting, we propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value iteration which uses exponential weights to smoothly forget data that are far in the past. We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $\widetilde{\mathcal{O}}(d^{5/4}H^2 \Delta^{1/4} K^{3/4})$ where $d$ is the dimension of the feature space, $H$ is the planning horizon, $K$ is the number of episodes and $\Delta$ is a suitable measure of non-stationarity of the MDP. Moreover, we point out technical gaps in the study of forgetting strategies in non-stationary linear bandits setting made by previous works and we propose a fix to their regret analysis.
翻译:我们研究非静止线性(a.k.a.低级)Markov决定过程(MDPs)的外形强化学习,即奖赏和过渡内核对于某一特性地图都是线性,可以缓慢或突然演变。对于这个问题的设置,我们建议OF-WLSVI是一种乐观的无模型算法,其基础是加权最低平方值的迭代,它使用指数重量来平稳地忘记过去很远的数据。我们表明,我们的算法,在每次与最佳政策竞争时,都取得了一种遗憾,这种遗憾被一个特定特性地图的线性内核($_Blusilde_mathcal{O}(d ⁇ 5/4}H%2\Delta_Q_1/4}K ⁇ 3/4}($D$是地段的维度,$H$是规划的地平线,$是事件的数量,而$Delta$是MDP的不常态性的适当衡量标准。此外,我们指出,在研究中,忘记非静止线性麻的策略时,我们建议用以往工作进行一项折反的脊定。