In this paper, we revisit the regret of undiscounted reinforcement learning in MDPs with a birth and death structure. Specifically, we consider a controlled queue with impatient jobs and the main objective is to optimize a trade-off between energy consumption and user-perceived performance. Within this setting, the \emph{diameter} $D$ of the MDP is $\Omega(S^S)$, where $S$ is the number of states. Therefore, the existing lower and upper bounds on the regret at time$T$, of order $O(\sqrt{DSAT})$ for MDPs with $S$ states and $A$ actions, may suggest that reinforcement learning is inefficient here. In our main result however, we exploit the structure of our MDPs to show that the regret of a slightly-tweaked version of the classical learning algorithm {\sc Ucrl2} is in fact upper bounded by $\tilde{\mathcal{O}}(\sqrt{E_2AT})$ where $E_2$ is related to the weighted second moment of the stationary measure of a reference policy. Importantly, $E_2$ is bounded independently of $S$. Thus, our bound is asymptotically independent of the number of states and of the diameter. This result is based on a careful study of the number of visits performed by the learning algorithm to the states of the MDP, which is highly non-uniform.
翻译:在本文中,我们重温了在有出生和死亡结构的MDP中进行无折扣强化学习的遗憾。 具体地说, 我们考虑在控制下排队, 排队时要有不耐烦的工作, 主要目标是优化能源消耗与用户观察到的性能之间的权衡。 在此背景下, MDP $D 是美元, 其中美元是州数。 因此, 以美元和死亡结构为单位的MDP 中, 现有的低限和高限, 以美元和美元为单位的MDP, 这可能表明加强学习效率低下。 然而, 在我们的主要结果中, 我们利用我们MDP 的结构来表明, 一种稍微微微微微微微弱的经典学习算法 =sc Ucrl2 的遗憾, 事实上, 美元和美元 美元和美元 美元( qrrt{D} ) 的遗憾, 以美元为单位( ) 和美元 美元( 美元) 的严格直径值为单位, 和 美元( 美元) 数字 的精确的计算结果是独立的, 以美元为基点为基数。