Hierarchical reinforcement learning (HRL) has seen widespread interest as an approach to tractable learning of complex modular behaviors. However, existing work either assume access to expert-constructed hierarchies, or use hierarchy-learning heuristics with no provable guarantees. To address this gap, we analyze HRL in the meta-RL setting, where a learner learns latent hierarchical structure during meta-training for use in a downstream task. We consider a tabular setting where natural hierarchical structure is embedded in the transition dynamics. Analogous to supervised meta-learning theory, we provide "diversity conditions" which, together with a tractable optimism-based algorithm, guarantee sample-efficient recovery of this natural hierarchy. Furthermore, we provide regret bounds on a learner using the recovered hierarchy to solve a meta-test task. Our bounds incorporate common notions in HRL literature such as temporal and state/action abstractions, suggesting that our setting and analysis capture important features of HRL in practice.
翻译:分层强化学习(HRL)认为,人们广泛感兴趣,认为这是一种对复杂模块行为进行分门别类学习的方法。然而,现有的工作要么假定有机会接触专家构建的等级,要么使用等级学习的累赘,而没有可核实的保证。为了解决这一差距,我们在元-RL环境中分析人力资源L,学习者在元培训期间学习潜伏的等级结构,用于下游任务。我们考虑一种表格设置,将自然等级结构嵌入过渡动态。与受监督的元学习理论相比,我们提供了“多样性条件”,与可移植的乐观算法一起,保证了这种自然等级的抽样高效恢复。此外,我们为利用已恢复的等级体系解决元测试任务的学习者提供了遗憾界限。我们的界限将时间和状态/行动抽象等共同概念纳入HRL文献中,表明我们的设置和分析在实践中捕捉到HRL的重要特征。