The options framework for hierarchical reinforcement learning has increased its popularity in recent years and has made improvements in tackling the scalability problem in reinforcement learning. Yet, most of these recent successes are linked with a proper options initialization or discovery. When an expert is available, the options discovery problem can be addressed by learning an options-type hierarchical policy directly from expert demonstrations. This problem is referred to as hierarchical imitation learning and can be handled as an inference problem in a Hidden Markov Model, which is done via an Expectation-Maximization type algorithm. In this work, we propose a novel online algorithm to perform hierarchical imitation learning in the options framework. Further, we discuss the benefits of such an algorithm and compare it with its batch version in classical reinforcement learning benchmarks. We show that this approach works well in both discrete and continuous environments and, under certain conditions, it outperforms the batch version.
翻译:强化等级学习的选项框架近年来越来越受欢迎,并在解决强化等级学习的可扩缩性问题方面有所改进。然而,最近这些成功大多与适当的选项初始化或发现相关。当专家具备时,发现选项的问题可以通过直接从专家演示中学习选项类型等级政策来解决。这个问题被称为等级仿造学习,并可在隐藏的Markov模型中作为一个推论问题处理,该模型是通过期望-最大化类型算法完成的。在这项工作中,我们提出一种新的在线算法,以便在选项框架中进行等级仿制学习。此外,我们讨论这种算法的好处,并将其与经典强化学习基准中的批次版本进行比较。我们表明,这种方法在离散和连续的环境中运作良好,在某些条件下,它比分批版本更完善。