State-of-the-art reinforcement learning (RL) algorithms suffer from high sample complexity, particularly in the sparse reward case. A popular strategy for mitigating this problem is to learn control policies by imitating a set of expert demonstrations. The drawback of such approaches is that an expert needs to produce demonstrations, which may be costly in practice. To address this shortcoming, we propose Probabilistic Planning for Demonstration Discovery (P2D2), a technique for automatically discovering demonstrations without access to an expert. We formulate discovering demonstrations as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree to find demonstration trajectories. These demonstrations are used to initialize a policy, then refined by a generic RL algorithm. We provide theoretical guarantees of P2D2 finding successful trajectories, as well as bounds for its sampling complexity. We experimentally demonstrate the method outperforms classic and intrinsic exploration RL techniques in a range of classic control and robotics tasks, requiring only a fraction of exploration samples and achieving better asymptotic performance.
翻译:减少这一问题的流行战略是通过模仿一套专家演示来学习控制政策。这种方法的缺点是,专家需要制作演示,而实际上可能成本很高。为了解决这一缺陷,我们提议为示范发现进行概率规划(P2D2),一种在无法接触专家的情况下自动发现演示的技术。我们把发现示范作为一种搜索问题,并利用广泛使用的规划算法,例如快速探索随机树来寻找示范轨迹。这些演示用来启动一项政策,然后由通用RL算法加以改进。我们为P2D2找到成功的轨迹及其取样复杂性的界限提供了理论保证。我们实验性地展示了在一系列经典控制和机器人任务中,方法超越了经典和内在的勘探RL技术,只需要一小部分勘探样品,并取得更好的抗争性性能。