Continuous time systems are often modeled using discrete time dynamics but this requires a small simulation step to maintain accuracy. In turn, this requires a large planning horizon which leads to computationally demanding planning problems and reduced performance. Previous work in model-free reinforcement learning has partially addressed this issue using action repeats where a policy is learned to determine a discrete action duration. Instead we propose to control the continuous decision timescale directly by using temporally-extended actions and letting the planner treat the duration of the action as an additional optimization variable along with the standard action variables. This additional structure has multiple advantages. It speeds up simulation time of trajectories and, importantly, it allows for deep horizon search in terms of primitive actions while using a shallow search depth in the planner. In addition, in the model-based reinforcement learning (MBRL) setting, it reduces compounding errors from model learning and improves training time for models. We show that this idea is effective and that the range for action durations can be automatically selected using a multi-armed bandit formulation and integrated into the MBRL framework. An extensive experimental evaluation both in planning and in MBRL, shows that our approach yields faster planning, better solutions, and that it enables solutions to problems that are not solved in the standard formulation.
翻译:连续时间系统常通过离散时间动力学建模,但这需要较小的仿真步长以保持精度。相应地,这要求较大的规划视野,导致计算需求高的规划问题与性能下降。先前在无模型强化学习中的工作通过动作重复部分解决了此问题,其中学习策略以确定离散动作持续时间。相反,我们提出通过使用时间扩展动作,并让规划器将动作持续时间作为额外的优化变量与标准动作变量一同处理,直接控制连续决策时间尺度。这种附加结构具有多重优势:它加速了轨迹的仿真时间,更重要的是,允许在原始动作层面进行深度视野搜索,同时在规划器中采用浅层搜索深度。此外,在基于模型的强化学习(MBRL)设置中,它减少了模型学习带来的复合误差,并改善了模型的训练时间。我们证明该思路是有效的,且动作持续时间的范围可通过多臂老虎机公式自动选择,并集成到MBRL框架中。在规划与MBRL中的广泛实验评估表明,我们的方法实现了更快的规划、更优的解,并能够解决标准公式无法处理的问题。