Real-world reinforcement learning (RL) is often severely limited since typical RL algorithms heavily rely on the reset mechanism to sample proper initial states. In practice, the reset mechanism is expensive to implement due to the need for human intervention or heavily engineered environments. To make learning more practical, we propose a generic no-regret reduction to systematically design reset-free RL algorithms. Our reduction turns reset-free RL into a two-player game. We show that achieving sublinear regret in this two-player game would imply learning a policy that has both sublinear performance regret and sublinear total number of resets in the original RL problem. This means that the agent eventually learns to perform optimally and avoid resets. By this reduction, we design an instantiation for linear Markov decision processes, which is the first provably correct reset-free RL algorithm to our knowledge.
翻译:由于典型的 RL 算法严重依赖重置机制来测试正确的初始状态,因此通常会受到严重限制,因为典型的 RL 算法严重依赖重置机制来测试正确的初始状态。 实际上, 重置机制由于需要人际干预或设计环境而执行成本昂贵。 为使学习更加实用, 我们提议了一种通用的无雷减少法来系统设计无重置 RL 算法。 我们的减法将无重置 RL 转换成双玩游戏。 我们显示, 在这场双玩游戏中实现子线性遗憾将意味着学习一种政策, 该政策在原始 RL 问题中既具有亚线性性性能遗憾, 也具有子线性性性性总和重置的累累。 这意味着代理商最终学会了最佳的操作, 并避免重置。 通过这一减法, 我们为线性马尔多夫 决策程序设计了一种即为我们知识中第一个正确重置的重置 RL 算法。