We consider the problem of tabular infinite horizon concave utility reinforcement learning (CURL) with convex constraints. Various learning applications with constraints, such as robotics, do not allow for policies that can violate constraints. To this end, we propose a model-based learning algorithm that achieves zero constraint violations. To obtain this result, we assume that the concave objective and the convex constraints have a solution interior to the set of feasible occupation measures. We then solve a tighter optimization problem to ensure that the constraints are never violated despite the imprecise model knowledge and model stochasticity. We also propose a novel Bellman error based analysis for tabular infinite-horizon setups which allows to analyse stochastic policies. Combining the Bellman error based analysis and tighter optimization equation, for $T$ interactions with the environment, we obtain a regret guarantee for objective which grows as $\Tilde{O}(1/\sqrt{T})$, excluding other factors.
翻译:我们考虑的是无孔不入的无孔不入的多功能强化学习(CURL)问题。各种有限制的学习应用,例如机器人,不允许采取可能违反限制的政策。为此,我们提出一种基于模型的学习算法,实现零限制违反。为了获得这一结果,我们假设,共孔目标和软盘限制在一套可行的占领措施中有一个内部解决办法。然后,我们解决一个更严格的优化问题,以确保尽管模型知识不精确和模型随机性,但制约从未被违反。我们还提议对无孔无孔不入的表格设置进行基于贝尔曼错误的新颖的分析,以便能够分析随机化政策。将基于贝尔曼错误的分析与更严格的优化方程式结合起来,对于美元与环境的相互作用,我们为以$Tilde{O}(1/\ sqrt{T}$(1/\qrt{T})美元增长的目标获得遗憾保证,其中不包括其他因素。