Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.
翻译:摘要:无奖学习(RF-RL)是最近引入的RL范例,它依靠随机采取动作来探索未知环境,而不需要任何奖励反馈信息。虽然RF-RL中探索阶段的主要目标是在最少的轨迹数下减少估计模型中的不确定性,但在实践中,智能体通常需要同时遵守某些安全约束。目前还不清楚这种安全探索要求会如何影响相应的样本复杂度,以实现获得计划所需的最优性。在本文中,我们首次尝试回答这个问题。特别是,我们考虑已知先前安全基准策略的情况,并提出了一个统一的安全的无奖励探索(SWEET)框架。然后,我们将SWEET框架具体化为表格和低秩MDP设置,并分别开发了Tabular-SWEET和Low-rank-SWEET算法。这两个算法都利用了新引入的截断值函数的凸性和连续性,并保证在探索过程中以高概率实现零约束违反。此外,两个算法都可以证明在规划阶段中找到满足任何约束的近似最优策略。值得注意的是,两个算法的样本复杂度与无约束对应项的现有技术相匹配甚至胜过某些常数因子,证明安全约束几乎不会增加RF-RL的样本复杂度。