In this paper, we present a Model-Based Reinforcement Learning algorithm named Monte Carlo Probabilistic Inference for Learning COntrol (MC-PILCO). The algorithm relies on Gaussian Processes (GPs) to model the system dynamics and on a Monte Carlo approach to estimate the policy gradient. This defines a framework in which we ablate the choice of the following components: (i) the selection of the cost function, (ii) the optimization of policies using dropout, (iii) an improved data efficiency through the use of structured kernels in the GP models. The combination of the aforementioned aspects affects dramatically the performance of MC-PILCO. Numerical comparisons in a simulated cart-pole environment show that MC-PILCO exhibits better data-efficiency and control performance w.r.t. state-of-the-art GP-based MBRL algorithms. Finally, we apply MC-PILCO to real systems, considering in particular systems with partially measurable states. We discuss the importance of modeling both the measurement system and the state estimators during policy optimization. The effectiveness of the proposed solutions has been tested in simulation and in two real systems, a Furuta pendulum and a ball-and-plate.
翻译:在本文中,我们提出一个称为Monte Carlo Control(MC-PILCO)的基于模型的加强学习能力分析算法(MC-PILCO),该算法依靠Gossian processes(GPs)来模拟系统动态,依靠Monte Carlo方法来估计政策梯度。这个算法界定了一个框架,在这个框架内,我们减少以下组成部分的选择:(一) 选择成本功能,(二) 优化使用辍学法的政策,(三) 通过在GP模型中使用结构化核心提高数据效率。上述各方面的结合极大地影响了MC-PILCO的性能。模拟马车极环境中的数值比较表明,MC-PILCO的数据效率和控制性能得到更好的体现。最后,我们将MC-PILCO应用到实际系统,特别考虑到有部分可计量状态的系统。我们讨论了在政策优化期间对测量系统和州估测数据系统进行建模的重要性。提议的解决办法的有效性在实际压和制压中进行了两次测试。