While the primary goal of the exploration phase in reward-free reinforcement learning (RF-RL) is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
翻译:虽然勘探阶段无报酬加固学习(RF-RL)的首要目标是减少估计模型的不确定性,并减少最低轨道数,但在实践中,该代理人往往需要同时遵守某些安全限制;尚不清楚这种安全勘探要求将如何影响相应的抽样复杂性,以实现所获得规划政策的理想最佳性;在这项工作中,我们首先试图回答这个问题。我们特别考虑事先知道安全基线政策的设想,并提出统一的安全再战探索(SWEET)框架。然后,我们将SWEET框架特别纳入表层和低级别MDP环境,并分别制定催化-SWEET和低级别SWEET的算法,以达到所获得规划政策的预期最佳性。两种算法都利用了新引入的调值功能的精度和连续性,保证在探索期间实现零约束性强的概率。此外,这两种算法都能找到几乎最优化的政策,在规划阶段中,以表格和低级别MDP设置的低级别框架中,其复杂性很难与常规性相比。