Model-based reinforcement learning (MBRL) with real-time planning has shown great potential in locomotion and manipulation control tasks. However, the existing planning methods, such as the Cross-Entropy Method (CEM), do not scale well to complex high-dimensional environments. One of the key reasons for underperformance is the lack of exploration, as these planning methods only aim to maximize the cumulative extrinsic reward over the planning horizon. Furthermore, planning inside the compact latent space in the absence of observations makes it challenging to use curiosity-based intrinsic motivation. We propose Curiosity CEM (CCEM), an improved version of the CEM algorithm for encouraging exploration via curiosity. Our proposed method maximizes the sum of state-action Q values over the planning horizon, in which these Q values estimate the future extrinsic and intrinsic reward, hence encouraging reaching novel observations. In addition, our model uses contrastive representation learning to efficiently learn latent representations. Experiments on image-based continuous control tasks from the DeepMind Control suite show that CCEM is by a large margin more sample-efficient than previous MBRL algorithms and compares favorably with the best model-free RL methods.
翻译:采用实时规划的基于模型的强化学习(MBRL)在移动和操纵控制任务方面显示出巨大的潜力,但是,现有的规划方法,例如跨英特罗比法(CEM),没有很好地推广到复杂的高维环境。业绩不佳的一个关键原因是缺乏探索,因为这些规划方法的目的只是尽量扩大规划视野的累积外表奖励。此外,在缺乏观察的情况下,在紧凑的潜伏空间内进行规划对使用基于好奇的内在动力提出了挑战。我们提议CIosity CEM(CEM),这是鼓励通过好奇心进行探索的改进版CEM算法。我们提议的方法使国家行动Q数值的数值在规划范围上最大化,在这些数值中,这些数值估计了未来的极限和内在的奖励,从而鼓励达成新的观察。此外,我们的模型利用对比性代表学习来有效学习潜值。对深精密控制套件基于图像的持续控制任务进行实验表明,CEM的样本效率比以前的MBRL算法要高得多,而且与最佳的模型比较。</s>