Reinforcement learning (RL) can be used to learn treatment policies and aid decision making in healthcare. However, given the need for generalization over complex state/action spaces, the incorporation of function approximators (e.g., deep neural networks) requires model selection to reduce overfitting and improve policy performance at deployment. Yet a standard validation pipeline for model selection requires running a learned policy in the actual environment, which is often infeasible in a healthcare setting. In this work, we investigate a model selection pipeline for offline RL that relies on off-policy evaluation (OPE) as a proxy for validation performance. We present an in-depth analysis of popular OPE methods, highlighting the additional hyperparameters and computational requirements (fitting/inference of auxiliary models) when used to rank a set of candidate policies. We compare the utility of different OPE methods as part of the model selection pipeline in the context of learning to treat patients with sepsis. Among all the OPE methods we considered, fitted Q evaluation (FQE) consistently leads to the best validation ranking, but at a high computational cost. To balance this trade-off between accuracy of ranking and computational efficiency, we propose a simple two-stage approach to accelerate model selection by avoiding potentially unnecessary computation. Our work serves as a practical guide for offline RL model selection and can help RL practitioners select policies using real-world datasets. To facilitate reproducibility and future extensions, the code accompanying this paper is available online at https://github.com/MLD3/OfflineRL_ModelSelection.
翻译:强化学习(RL)可用于在医疗保健领域学习治疗政策和帮助决策。然而,鉴于需要对复杂的州/行动空间进行总体化,在部署时纳入功能近似器(如深神经网络)需要模型选择以减少超装和改进政策性能;然而,模型选择的标准验证管道要求在实际环境中执行一种在医疗保健环境中往往不可行的学习性能政策。在这项工作中,我们调查了离线RL的示范选择管道,以离线评估(OPE)作为验证业绩的代名词。我们深入分析了流行的OPE方法,强调了额外的超参数和计算要求(安装/推断辅助模型),用于对一套候选政策进行排名。我们比较了不同的OPE方法作为学习治疗病人的示范选择管道的一部分的效用。在我们认为的所有OPE方法中,将不必要评估(FQE)始终导致最佳的验证排名,但在高计算成本时,我们强调了随行的OPE扩展方法,强调了额外的超比值和计算要求(设置/推断辅助模型),以平衡我们目前可用的在线选择效率的方法。我们通过简化的计算,可以加速进行在线计算。