关于基于模式的强化学习的有效时间安排 (On Effective Scheduling of Model-based Reinforcement Learning)

Model-based reinforcement learning has attracted wide attention due to its superior sample efficiency. Despite its impressive success so far, it is still unclear how to appropriately schedule the important hyperparameters to achieve adequate performance, such as the real data ratio for policy optimization in Dyna-style model-based algorithms. In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance. Inspired by the analysis, we propose a framework named AutoMBPO to automatically schedule the real data ratio as well as other hyperparameters in training model-based policy optimization (MBPO) algorithm, a representative running case of model-based methods. On several continuous control tasks, the MBPO instance trained with hyperparameters scheduled by AutoMBPO can significantly surpass the original one, and the real data ratio schedule found by AutoMBPO shows consistency with our theoretical analysis.

翻译：以模型为基础的强化学习因其优异的样本效率而吸引了广泛的关注。尽管迄今为止取得了令人印象深刻的成功,但仍不清楚如何适当安排重要的超参数,以实现适当的性能,例如Dyna式基于模型的算法中政策优化的实际数据比率。在本文中,我们首先从理论上分析实际数据在政策培训中的作用,这说明实际数据的比例会逐渐提高,从而产生更好的性能。在分析的启发下,我们提议了一个名为AutoMBPO的框架,以自动安排实际数据比率以及基于培训模型的政策优化算法(MBPPO)中的其他超参数,这是示范方法的典型运行案例。在几个连续的控制任务中,由AutoMBPO规划的超光度参数培训的MBPO实例可以大大超过最初的参数,AutoMBPO发现的实际数据比率表与我们的理论分析是一致的。