Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severely limited. In this work, we focus on this offline setting. Our main insight is that, when presented with offline data composed of a variety of behaviors, an effective way to leverage this data is to extract a continuous space of recurring and temporally extended primitive behaviors before using these primitives for downstream task learning. Primitives extracted in this way serve two purposes: they delineate the behaviors that are supported by the data from those that are not, making them useful for avoiding distributional shift in offline RL; and they provide a degree of temporal abstraction, which reduces the effective horizon yielding better learning in theory, and improved offline RL in practice. In addition to benefiting offline policy optimization, we show that performing offline primitive learning in this way can also be leveraged for improving few-shot imitation learning as well as exploration and transfer in online RL on a variety of benchmark domains. Visualizations are available at https://sites.google.com/view/opal-iclr
翻译:强化学习(RL)在各种在线环境中取得了令人印象深刻的成绩,在这种环境中,代理商为过渡和奖励而查询环境的能力实际上没有限制。然而,在许多实际应用中,情况被扭转:代理商可能能够获得大量非定向的离线经验数据,而在线环境的准入则受到严重限制。在这项工作中,我们侧重于这一离线设置。我们的主要见解是,如果用由各种行为组成的离线数据来显示,利用这一数据的一个有效途径是利用离线政策优化,在利用这些原始技术进行下游任务学习之前,不断获得反复出现和暂时延伸的原始行为空间。 以这种方式提取的原始技术有两个目的:它们描述由非在线经验数据支持的行为,使其有助于避免离线RL的分布转移;它们提供一定程度的时间抽象,从而降低有效地平线,从而在理论上更好地学习,并在实践上改进离线RL。除了受益于离线政策优化外,我们还表明,在使用这些原始方法进行离线的离线原始学习的同时,也可以利用这种方式进行离线的原始学习,用于改进少数可获取的Rshoption/Ls 学习。