In constrained reinforcement learning (RL), a learning agent seeks to not only optimize the overall reward but also satisfy the additional safety, diversity, or budget constraints. Consequently, existing constrained RL solutions require several new algorithmic ingredients that are notably different from standard RL. On the other hand, reward-free RL is independently developed in the unconstrained literature, which learns the transition dynamics without using the reward information, and thus naturally capable of addressing RL with multiple objectives under the common dynamics. This paper bridges reward-free RL and constrained RL. Particularly, we propose a simple meta-algorithm such that given any reward-free RL oracle, the approachability and constrained RL problems can be directly solved with negligible overheads in sample complexity. Utilizing the existing reward-free RL solvers, our framework provides sharp sample complexity results for constrained RL in the tabular MDP setting, matching the best existing results up to a factor of horizon dependence; our framework directly extends to a setting of tabular two-player Markov games, and gives a new result for constrained RL with linear function approximation.
翻译:在有限的强化学习(RL)中,一个学习代理机构不仅力求优化总体奖励,而且满足额外的安全、多样性或预算限制,因此,现有的受限制的RL解决方案需要几种与标准RL截然不同的新算法要素。 另一方面,无报酬的RL是在不受限制的文献中独立开发的,在不使用奖励信息的情况下学习过渡动态,因此自然能够以共同动态下的多重目标处理RL。这个纸桥没有奖赏的RL和受限制的RL。特别是,我们提出了一个简单的元等级,根据任何无报酬的RL或Cacle,可以直接解决可接近性和受限制的RL问题,而抽样复杂的间接费用微不足道。利用现有的无报酬RL解决器,我们的框架为表MDP设置的受限的RL提供了精确的样本复杂性结果,将现有最佳结果与地平线依赖因素相匹配;我们的框架直接延伸至一个表式二玩家Markov游戏的设置,并给受限制的RL带来线性功能近的新结果。