Deep Reinforcement Learning (DRL) has demonstrated great potentials in solving sequential decision making problems in many applications. Despite its promising performance, practical gaps exist when deploying DRL in real-world scenarios. One main barrier is the over-fitting issue that leads to poor generalizability of the policy learned by DRL. In particular, for offline DRL with observational data, model selection is a challenging task as there is no ground truth available for performance demonstration, in contrast with the online setting with simulated environments. In this work, we propose a pessimistic model selection (PMS) approach for offline DRL with a theoretical guarantee, which features a provably effective framework for finding the best policy among a set of candidate models. Two refined approaches are also proposed to address the potential bias of DRL model in identifying the optimal policy. Numerical studies demonstrated the superior performance of our approach over existing methods.
翻译:深度强化学习(DRL)在许多应用中显示了解决连续决策问题的巨大潜力。尽管其表现令人乐观,但在现实世界情景下部署DRL时,实际存在差距。一个主要障碍是DRL所学政策过于合适,导致DRL所学政策缺乏一般性。特别是,对于离线的DRL和观测数据而言,模式选择是一项艰巨的任务,因为与模拟环境的在线环境相比,没有可用于表现示范的地面事实。在这项工作中,我们提出对离线的DRL采用悲观模式选择方法,并有理论保证,这为在一套候选模型中寻找最佳政策提供了可喜而有效的框架。还提出了两种改进的方法,以解决DRL模式在确定最佳政策方面的潜在偏差。数字研究表明,我们的方法优于现有方法。