Providing densely shaped reward functions for RL algorithms is often exceedingly challenging, motivating the development of RL algorithms that can learn from easier-to-specify sparse reward functions. This sparsity poses new exploration challenges. One common way to address this problem is using demonstrations to provide initial signal about regions of the state space with high rewards. However, prior RL from demonstrations algorithms introduce significant complexity and many hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high $Q$-values in corresponding regions of the state space. Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms. See https://sites.google.com/view/mcac-rl for code and supplementary material.
翻译:为RL算法提供高度塑造的奖赏功能往往极具挑战性,这促使发展可以从较容易到更具体地说明少许奖赏功能中学习的RL算法。这种宽度提出了新的勘探挑战。解决这一问题的一个常见办法是利用演示来提供关于国家空间区域的初步信号,并获得高额奖励。然而,以前的示范算法的RL带来了相当复杂和许多超光谱,使其难以执行和调控。我们引入了蒙特卡洛增强动作立方体(MCAC),这是对标准的行为者-critic算法的一种参数自由修改,该算法通过演示使重新播放缓冲开始,并通过采用标准时间距离(TD)的最大指标和蒙特卡洛对奖励到Go的估计来计算一个修改的Q$-美元值。这鼓励在高效轨道附近进行探索,鼓励在州空间的相应区域使用高额的$-Q价值。在5美元的连续控制域进行实验表明,MAC可大幅提高通常使用的RL和RL-crosma-fro-comma-commagrationsalations。