Hierarchical decomposition of control is unavoidable in large dynamical systems. In reinforcement learning (RL), it is usually solved with subgoals defined at higher policy levels and achieved at lower policy levels. Reaching these goals can take a substantial amount of time, during which it is not verified whether they are still worth pursuing. However, due to the randomness of the environment, these goals may become obsolete. In this paper, we address this gap in the state-of-the-art approaches and propose a method in which the validity of higher-level actions (thus lower-level goals) is constantly verified at the higher level. If the actions, i.e. lower level goals, become inadequate, they are replaced by more appropriate ones. This way we combine the advantages of hierarchical RL, which is fast training, and flat RL, which is immediate reactivity. We study our approach experimentally on seven benchmark environments.
翻译:在大型动态系统中,控制等级分解是不可避免的。在强化学习(RL)中,它通常是通过在较高政策级别上界定并在较低政策级别上实现的次级目标来解决的。实现这些目标需要相当长的时间,在这段时间里无法核实它们是否仍然值得追求这些目标。然而,由于环境的随机性,这些目标可能会过时。在本文件中,我们解决了最先进方法中的这一差距,并提出了一个方法,即高层次行动(低层次目标)的有效性在较高层次上不断得到核实。如果行动(即低层次目标)变得不够充分,它们就会被更合适的目标所取代。这样,我们就可以把高层次的RL(即快速培训)和平坦的RL(即即即即直接回流)的优势结合起来。我们在七个基准环境中进行实验性研究我们的方法。