We consider reinforcement learning (RL) methods in offline domains without additional online data collection, such as mobile health applications. Most of existing policy optimization algorithms in the computer science literature are developed in online settings where data are easy to collect or simulate. Their generalizations to mobile health applications with a pre-collected offline dataset remain unknown. The aim of this paper is to develop a novel advantage learning framework in order to efficiently use pre-collected data for policy optimization. The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator. Extensive numerical experiments are conducted to back up our theoretical findings. A Python implementation of our proposed method is available at https://github.com/leyuanheart/SEAL.
翻译:计算机科学文献中现有的政策优化算法大多是在数据易于收集或模拟的在线环境中开发的。这些算法对移动健康应用的概括性以及预先收集的离线数据集仍然不为人知。本文的目的是开发一个新的优势学习框架,以便有效利用收集前的数据来优化政策。拟议方法采用最佳的量化算法,由任何现有最先进的RL算法作为投入计算,并产生新的政策,其价值保证以比最初的Q-Sitemator所制定的政策更快的速度汇合。为了支持我们的理论结论,进行了广泛的数字实验。我们提议的方法的Python实施情况可在https://github.com/leyuanheart/SEAL上查阅。