Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
翻译:最近推出的RL模式(RF-RL-RL)的无奖励强化学习(RF-RL)是最近推出的RL模式,它依靠随机行动来探索未知的环境,而没有任何奖励的反馈信息。虽然RF-RL勘探阶段的首要目标是减少估计模型的不确定性,使用最少的轨迹,但在实践中,该代理人往往需要同时遵守某些安全限制;仍然不清楚这种安全勘探要求将如何影响相应的样本复杂性,以便实现所获取的政策在规划中的理想最佳性。在这项工作中,我们首先试图回答这个问题。我们特别考虑的是事先知道安全基线政策的情景,并提出一个统一的SWEET框架,使用最少的轨迹(SWEET)框架来减少估计模型的不确定性。然后我们特别将SWEET框架与表格和低级的 MDP环境特别挂钩,并分别制定与Tabulal-SWEET和低级标准挂钩的算法。这两种算法都利用了新引入的调值功能的精确性和连续性和连续性,甚至保证在勘探阶段内几乎实现零约束性的政策风险。