Non-differentiable controllers and rule-based policies are widely used for controlling real systems such as robots and telecommunication networks. In this paper, we present a practical reinforcement learning method which improves upon such existing policies with a model-based approach for better sample efficiency. Our method significantly outperforms state-of-the-art model-based methods, in terms of sample efficiency, on several widely used robotic benchmark tasks. We also demonstrate the effectiveness of our approach on a control problem in the telecommunications domain, where model-based methods have not previously been explored. Experimental results indicate that a strong initial performance can be achieved and combined with improved sample efficiency. We further motivate the design of our algorithm with a theoretical lower bound on the performance.
翻译:控制机器人和电信网络等实际系统时,广泛使用非差别控制器和基于规则的政策。在本文中,我们提出了一个实用的强化学习方法,用基于模型的方法改进这些现行政策,提高抽样效率。我们的方法在抽样效率方面大大优于基于模型的先进方法,在几个广泛使用的机器人基准任务方面表现得最为出色。我们还展示了我们处理电信领域控制问题的方法的有效性,在这个领域中,以前没有探讨过基于模型的方法。实验结果表明,可以实现很强的初始性能,并结合提高抽样效率。我们进一步激励我们算法的设计,理论上对性能的限制较低。