We consider reinforcement learning (RL) in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process. At each time step $t$, it earns a reward, and also incurs a cost-vector consisting of $M$ costs. We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps, while simultaneously ensuring that the average values of the $M$ cost expenditures are bounded by agent-specified thresholds $c^{ub}_i,i=1,2,\ldots,M$. In order to measure the performance of a reinforcement learning algorithm that satisfies the average cost constraints, we define an $M+1$ dimensional regret vector that is composed of its reward regret, and $M$ cost regrets. The reward regret measures the sub-optimality in the cumulative reward, while the $i$-th component of the cost regret vector is the difference between its $i$-th cumulative cost expense and the expected cost expenditures $Tc^{ub}_i$. We prove that the expected value of the regret vector of UCRL-CMDP, is upper-bounded as $\tilde{O}\left(T^{2\slash 3}\right)$, where $T$ is the time horizon. We further show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs. To the best of our knowledge, ours is the only work that considers non-episodic RL under average cost constraints, and derive algorithms that can~\emph{tune the regret vector} according to the agent's requirements on its cost regrets.
翻译:我们考虑在Markov决策过程中强化学习(RL),其中代理商反复与以受控的Markov进程为模型的环境互动。每一步美元,它就得到奖励,并产生成本向量,由美元成本构成。我们设计基于模型的RL算法,使在T美元时间跨步的时限内获得的累积报酬最大化,同时确保美元成本支出的平均值受代理商规定的阈值($cuub ⁇ i,i=1,2,rdots,M美元)的约束。为了测量满足平均成本限制的强化学习算法的绩效,我们定义了以M+1美元为单位的成本向量的向量。我们设计了一个基于模型的RL算法算法,在时间跨步的时间跨度上,而成本向量的美元,而成本向量是美元,成本的美元,而成本累计成本的美元与成本的预期值之间的差。 我们的递增成本, 以Oxxxroral的预期值表示UCRL的矢量值值值, 成本的递增成本的值。