We study the reinforcement learning for finite-horizon episodic Markov decision processes with adversarial reward and full information feedback, where the unknown transition probability function is a linear function of a given feature mapping. We propose an optimistic policy optimization algorithm with Bernstein bonus and show that it can achieve $\tilde{O}(dH\sqrt{T})$ regret, where $H$ is the length of the episode, $T$ is the number of interaction with the MDP and $d$ is the dimension of the feature mapping. Furthermore, we also prove a matching lower bound of $\tilde{\Omega}(dH\sqrt{T})$ up to logarithmic factors. To the best of our knowledge, this is the first computationally efficient, nearly minimax optimal algorithm for adversarial Markov decision processes with linear function approximation.
翻译:我们用对抗性奖赏和充分的信息反馈来研究关于有限和偏差的Spidic Markov 决策程序的强化学习,其中未知的过渡概率函数是某个特性绘图的线性函数。我们建议用伯恩斯坦奖金来提出一个乐观的政策优化算法,并表明它能够达到$\tilde{O}(dH\sqrt{T})美元(dH\sqrt{T})的遗憾,其中H$是插曲的长度,$T$是同MDP的互动次数,$d$是特征绘图的维度。此外,我们还证明匹配的比值较低,为$\tilde_Omega}(dH\sqrt{T}),最高达对数系数。据我们所知,这是第一个计算高效的、接近线性函数的对抗性Markov 决策过程的近似小型最佳算法。