In this paper, we present a Model-Based Reinforcement Learning (MBRL) algorithm named \emph{Monte Carlo Probabilistic Inference for Learning COntrol} (MC-PILCO). The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient. This defines a framework in which we ablate the choice of the following components: (i) the selection of the cost function, (ii) the optimization of policies using dropout, (iii) an improved data efficiency through the use of structured kernels in the GP models. The combination of the aforementioned aspects affects dramatically the performance of MC-PILCO. Numerical comparisons in a simulated cart-pole environment show that MC-PILCO exhibits better data efficiency and control performance w.r.t. state-of-the-art GP-based MBRL algorithms. Finally, we apply MC-PILCO to real systems, considering in particular systems with partially measurable states. We discuss the importance of modeling both the measurement system and the state estimators during policy optimization. The effectiveness of the proposed solutions has been tested in simulation and on two real systems, a Furuta pendulum and a ball-and-plate rig.
翻译:在本文中,我们提出了一个模型为基础的强化学习(MBRL)算法,名为 emph{Monte Carlo Probabisic Inferences for Learning Conndrol}(MC-PILCO),该算法依靠高山工艺(GPs)来模拟系统动态和蒙特卡洛方法来估计政策梯度。这界定了一个框架,在这个框架内,我们选择了以下组成部分:(一) 选择成本函数,(二) 优化使用辍学政策,(三) 通过在GP模型中使用结构化的内核来提高数据效率。上述各方面的结合极大地影响了MC-PILCO的性能。在模拟马车极环境中的数值比较表明,MC-PILCO展示了更好的数据效率和控制功能 w.r.t.t. 。最后,我们将MC-PILCO应用到真实系统,同时考虑到部分可计量的状态。我们讨论了在模拟和制式模型中进行模拟的两种测试的解决方案的重要性。