We study learning in periodic Markov Decision Process (MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period $N$ and as $\mathcal{O}(\sqrt{Tlog T})$ with the horizon length $T$. Utilizing the information about the sparsity of transition matrix of augmented MDP, we propose another algorithm PUCRLB which enhances upon PUCRL2, both in terms of regret ($O(\sqrt{N})$ dependency on period) and empirical performance. Finally, we propose two other algorithms U-PUCRL2 and U-PUCRLB for extended uncertainty in the environment in which the period is unknown but a set of candidate periods are known. Numerical results demonstrate the efficacy of all the algorithms.
翻译:摘要:我们研究了周期性马尔可夫决策过程(MDP)的在线学习,这是一种特殊类型的非静态MDP,其中状态转移概率和奖励函数都会周期性变化,且在平均回报最大化场景下。我们通过将周期指数加入状态空间,将问题制定为一个静态MDP,并提出了一个周期性上限置信度强化学习-2(PUCRL2)算法。我们证明了PUCRL2的遗憾值与周期N成线性关系,并与时限长度T成$\mathcal{O}(\sqrt{Tlog T})$的关系。利用增广MDP的转移矩阵稀疏信息,我们提出了另一个算法PUCRLB,它在遗憾值(周期O($\sqrt{N}$))和经验表现等方面均优于PUCRL2。最后,我们提出了另外两种算法U-PUCRL2和U-PUCRLB,用于处理环境的扩展不确定性,其中周期未知但一组候选周期已知。数值结果表明了所有算法的有效性。