在制约下学习 Markov 决策过程中的学习 (Learning in Markov Decision Processes under Constraints)

We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design learning algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. The considerations on the cumulative cost expenditures departs from the existing literature, in that the agent now additionally needs to balance the cost expenses in an online manner, while simultaneously performing the exploration-exploitation trade-off that is typically encountered in RL tasks. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that with a high probablity, the regret vector of UCRL-CMDP is upper-bounded as $O\left( S\sqrt{AT^{1.5}\log(T)}\right)$, where $S$ is the number of states, $A$ is the number of actions, and $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.

翻译：我们考虑在Markov Decision Processes中强化学习(RL),在这个过程中,一个代理商反复与以受控的Markov 程序为模型的环境互动。每一步,它都会得到美元报酬,并产生由美元成本构成的成本。我们设计了学习算法,在T$的时段里最大限度地增加累积的收益,同时确保美元成本支出的平均值受代理商规定的阈值($%%%i=1,2,rdots,M美元)。累计成本支出的考虑因素与现有文献不同,因为代理商现在需要额外平衡在线成本支出,同时进行勘探-开发交易,通常在RL任务中遇到的是一个成本。为了衡量能够满足平均成本限制的强化学习算法的性能,我们只定义了以M+1美元为基米的遗憾量,这包括它的回报率差值,而美元成本进一步确定。报酬遗憾地是,在累积的AAT AT 成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本水平成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本成本