We study learning in periodic Markov Decision Process(MDP), a special type of non-stationary MDP where both the state transition probabilities and reward functions vary periodically, under the average reward maximization setting. We formulate the problem as a stationary MDP by augmenting the state space with the period index, and propose a periodic upper confidence bound reinforcement learning-2 (PUCRL2) algorithm. We show that the regret of PUCRL2 varies linearly with the period and as sub-linear with the horizon length. Numerical results demonstrate the efficacy of PUCRL2.
翻译:我们学习的是定期的Markov决定程序(MDP),这是一种特殊的非静止的MDP程序,在平均奖励最大化的环境下,国家过渡概率和奖励功能定期变化。我们通过用时期指数扩大国家空间,将问题发展成固定的MDP,并提议定期的上层信任约束强化学习-2(PUCRL2)算法。我们显示,PUCRL2的遗憾随着时间的长短而线性地和视线长度的子线性不同。数字结果显示PUCRL2的功效。