Real-world reinforcement learning (RL) is often severely limited since typical RL algorithms heavily rely on the reset mechanism to sample proper initial states. In practice, the reset mechanism is expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns reset-free RL into a two-player game. We show that achieving sublinear regret in this two player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. By this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm to our knowledge.
翻译:由于典型的 RL 算法严重依赖重置机制来测试正确的初始状态,因此通常会受到严重限制,因为典型的 RL 算法严重依赖重置机制来测试正确的初始状态。 实际上, 重置机制由于需要人际干预或设计得力的环境, 实施成本昂贵。 为了让学习更加实用, 我们建议使用一种通用的无雷特减法来系统设计无重置 RL 算法。 我们的降法将无重置 RL 转换成一个双玩家游戏。 我们显示, 实现这两次玩家游戏的亚线性差将意味着学习一种政策, 该政策在原始 RL 问题中既具有亚线性性性表现的遗憾, 也具有次线性的总的累累。 这意味着代理商最终学会了最佳的操作, 并避免重置。 通过这一减法, 我们为线性Markov 决策程序设计了一个即我们知识中第一个可辨正确重置 RL 的重置 RL 算法 。