何时信任您的模拟器: 动态- 软件混合离线和在线强化学习 (When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning)

Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.

翻译：用于解决现实世界复杂任务的有效强化学习(RL)政策,如果没有高贞洁模拟环境,可能具有相当大的挑战性。在多数情况下,我们只能得到不完善的模拟器,而没有高贞洁的模拟环境。我们只能得到不完善的模拟器,而这种模拟器只能简化动态,这不可避免地导致在RL政策学习中出现严重的模拟到现实差距。最近出现的离线的RL领域为直接从预先收集的历史数据中学习政策提供了另一种可能性。然而,为了实现合理的绩效,现有的离线的RL算法需要不切实际的大型离线数据,并有足够的州-市间混合空间来进行培训。这提出了一个新问题:从离线的RL和不受限制的探索中学习有限的真实数据,能否通过在线的不完善模拟模拟器进行不受限制的探索,从而解决这两种方法的缺陷?在这个研究中,我们建议“Dald-Award 混合离线和在线强化学习(H2O)框架来提供肯定的答案。H2O引入一个动态认识的动态-认识政策评估计划,这有可能惩罚Q在模拟国家行动配对大型动态和动态阵列的模拟阵列中学习,同时从真实的轨道的模拟操作,同时从真实的高级设计中学习,并展示一个真实的高级数据,作为真实的轨道的模拟的模拟,通过真实的模拟的轨道的模拟,通过真实的模拟的模拟的模拟的模拟的模拟进行。