Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.
翻译:强化学习是一种一般技术,它使代理商能够学习最佳政策,并在顺序决策问题中与环境互动。一项政策的优缺点从某个初始状态开始,以其价值函数来衡量。本文件的重点是在决定点数与无限性相异的无限地平线环境中,为政策价值构建信任间隔(CIs)。我们提议模拟与基于系列/保密方法的政策相关的行动价值国家功能(Q-功能),以得出其信任间隔。当目标政策也取决于观察到的数据时,我们提议一个序列值评价(SAVE)方法,以循环更新估计政策及其价值估计值。只要轨迹数或决定点数与无限性不相符,我们就会显示拟议的CIC即使在并非最佳政策独特的情况下也达到了名义覆盖范围。我们进行了模拟研究,以便支持我们的理论发现我们从流动健康研究中得出的数据组合,我们发现强化学习算法可以帮助改善病人的健康状况。AVE/SHI37号的拟议程序可以使用。