A ubiquitous requirement in many practical reinforcement learning (RL) applications, including medical treatment, recommendation system, education and robotics, is that the deployed policy that actually interacts with the environment cannot change frequently. Such an RL setting is called low-switching-cost RL, i.e., achieving the highest reward while reducing the number of policy switches during training. Despite the recent trend of theoretical studies aiming to design provably efficient RL algorithms with low switching costs, none of the existing approaches have been thoroughly evaluated in popular RL testbeds. In this paper, we systematically studied a wide collection of policy-switching approaches, including theoretically guided criteria, policy-difference-based methods, and non-adaptive baselines. Through extensive experiments on a medical treatment environment, the Atari games, and robotic control tasks, we present the first empirical benchmark for low-switching-cost RL and report novel findings on how to decrease the switching cost while maintain a similar sample efficiency to the case without the low-switching-cost constraint. We hope this benchmark could serve as a starting point for developing more practically effective low-switching-cost RL algorithms. We release our code and complete results in https://sites.google.com/view/low-switching-cost-rl.
翻译:在许多实际强化学习(RL)应用中,包括医疗、建议系统、教育和机器人应用,普遍要求许多实用强化学习(RL)应用,包括医疗、建议系统、教育和机器人应用中,部署的政策不能经常改变,实际与环境互动的政策不能经常改变。这种RL设置被称为低开关成本RL,即获得最高奖励,同时减少培训期间的政策开关数量。尽管最近出现了旨在设计低开关成本的可衡量高效RL算法的理论研究趋势,但在广受欢迎的RL测试床中,没有对现有方法进行彻底评价。在本文中,我们系统地研究了一系列广泛的政策调换方法,包括理论上的指导标准、基于政策调和基于政策的方法以及非适应基线。通过在医疗环境、阿塔里游戏和机器人控制任务方面的广泛实验,我们提出了低开关成本RL的第一个经验基准,并报告了如何降低转换成本,同时保持与案件相似的抽样效率,而没有低开关成本约束。我们希望这一基准可以作为制定更实际有效的低开关/低开关/成本算法的起点。