This paper presents the first {\em model-free}, {\em simulator-free} reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it has three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three Q values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1 }{\delta}H^4 S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}} \right)$ regret, where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater's constant. Furthermore, Triple-Q guarantees zero constraint violation when $K$ is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs and is computationally efficient.
翻译:本文为 Constraced Markov 决策进程( CMDPs ) 提供了第一个“ 模型无模式 ” 、 “ 模拟无模式 ” 强化学习算法。 该算法名为“ 三- Q ”, 因为它有三个关键组成部分: 累积奖赏的Q函数( 也称为“ 行动价值 ” ), 制约的累积效用的Q函数, 以及( 超过) 估计累计限制违反的虚拟Que 。 在三步中, 每步都选择一个以伪Q值( CMDPs) 为组合的组合。 该算法更新了奖赏和实用值Q值, 取决于访问的计数与对应的( 状态、 动作) 配对, 并定期重新设置。 在CMDPsad 设置中, Triple- Q = 美元 美元= 美元 美元 美元( 1 = = 美元 美元) 。 S\\\\\\\\\\\ 美元 美元( 美元) 美元, 美元= 美元( 美元) 美元) 数的递越 数是 Q = 。 数 。 = 美元 。 = 美元 美元= 。 = = 美元, 美元= = = = 美元, 美元, 美元= 美元= 。