模型生成,具有离线强化学习可探测覆盖性 (Model Generation with Provable Coverability for Offline Reinforcement Learning)

Model-based offline optimization with dynamics-aware policy provides a new perspective for policy learning and out-of-distribution generalization, where the learned policy could adapt to different dynamics enumerated at the training stage. But due to the limitation under the offline setting, the learned model could not mimic real dynamics well enough to support reliable out-of-distribution exploration, which still hinders policy to generalize well. To narrow the gap, previous works roughly ensemble randomly initialized models to better approximate the real dynamics. However, such practice is costly and inefficient, and provides no guarantee on how well the real dynamics could be approximated by the learned models, which we name coverability in this paper. We actively address this issue by generating models with provable ability to cover real dynamics in an efficient and controllable way. To that end, we design a distance metric for dynamic models based on the occupancy of policies under the dynamics, and propose an algorithm to generate models optimizing their coverage for the real dynamics. We give a theoretical analysis on the model generation process and proves that our algorithm could provide enhanced coverability. As a downstream task, we train a dynamics-aware policy with minor or no conservative penalty, and experiments demonstrate that our algorithm outperforms prior offline methods on existing offline RL benchmarks. We also discover that policies learned by our method have better zero-shot transfer performance, implying their better generalization.

翻译：以基于模型的离线优化为动态觉醒政策提供了政策学习和分配外概括的新视角,学习的政策可以适应培训阶段所列举的不同动态。但是,由于离线设置的限制,学习的模型无法充分模仿真实动态,以支持可靠的分配外探索,这仍然阻碍政策的推广。为了缩小差距,以往的工作大致是混合随机初始化模型,以更好地估计真实动态。然而,这种做法成本高低,效率高,无法保证我们在本文件中列出的学习模型能够与真实动态相近得多。我们积极解决这一问题,方法是创建具有可辨别能力的模式,以高效和控制的方式覆盖真实动态。为此,我们根据动态下的政策的占有情况设计了动态模型的距离指标,并提出一种算法,以生成模型优化其真实动态覆盖度。我们对模型生成过程进行理论分析,并证明我们的算法可以提供更大的覆盖性。作为下游任务,我们通过开发一种具有更好的动态-意识的模型,我们用更精细的、更精确的、更精确的演算法,我们也可以用更精确的演算法来展示我们现有的标准。