We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.
翻译:我们考虑在可能的非静止环境中采用离线强化学习(RL)方法。文献中的许多现有RL算法依靠固定性假设,要求系统过渡和奖励功能在一段时间内保持不变。然而,固定性假设在实践中是限制性的,在包括交通信号控制、机器人和移动健康在内的一些应用中有可能遭到违反。在本文件中,我们开发了一个一致的程序,测试基于预先收集的历史数据的最佳政策是否不固定,而没有额外的在线数据收集。根据拟议的测试,我们进一步开发了一种顺序变化点探测方法,可以自然地与现有的最先进的RL方法相结合,以便在非静止环境中优化政策。我们的方法的有用性通过理论结果、模拟研究以及2018年期间健康研究的一个真实数据实例加以说明。在https://github.com/limengbinggz/CUSUM-RL上可以查阅拟议程序的Python实施情况。