KL-regularized reinforcement learning from expert demonstrations has proved successful in improving the sample efficiency of deep reinforcement learning algorithms, allowing them to be applied to challenging physical real-world tasks. However, we show that KL-regularized reinforcement learning with behavioral reference policies derived from expert demonstrations can suffer from pathological training dynamics that can lead to slow, unstable, and suboptimal online learning. We show empirically that the pathology occurs for commonly chosen behavioral policy classes and demonstrate its impact on sample efficiency and online policy performance. Finally, we show that the pathology can be remedied by non-parametric behavioral reference policies and that this allows KL-regularized reinforcement learning to significantly outperform state-of-the-art approaches on a variety of challenging locomotion and dexterous hand manipulation tasks.
翻译:从专家演示中学习的KL常规强化学习,在提高深度强化学习算法的抽样效率方面证明是成功的,这些算法可以应用于具有挑战性的实际世界任务。然而,我们表明,通过专家演示中产生的行为参照政策,KL常规强化学习可能受到病理学培训动态的影响,可能导致缓慢、不稳定和不理想的在线学习。我们从经验上表明,常见选择的行为政策课程出现病理学,并表明其对抽样效率和在线政策绩效的影响。最后,我们表明,非参数行为参照政策可以纠正病理学,这使得KL常规强化学习在各种具有挑战性的移动和超时巧手操纵任务上大大超越了最先进的方法。