用于现实世界治疗优化应用的深离线强化深层强化学习 (Deep Offline Reinforcement Learning for Real-World Treatment Optimization Applications)

There is increasing interest in data-driven approaches for dynamically choosing optimal treatment strategies in many chronic disease management and critical care applications. Reinforcement learning methods are well-suited to this sequential decision-making problem, but must be trained and evaluated exclusively on retrospective medical record datasets as direct online exploration is unsafe and infeasible. Despite this requirement, the vast majority of dynamic treatment optimization studies use off-policy RL methods (e.g., Double Deep Q Networks (DDQN) or its variants) that are known to perform poorly in purely offline settings. Recent advances in offline RL, such as Conservative Q-Learning (CQL), offer a suitable alternative. But there remain challenges in adapting these approaches to real-world applications where suboptimal examples dominate the retrospective dataset and strict safety constraints need to be satisfied. In this work, we introduce a practical transition sampling approach to address action imbalance during offline RL training, and an intuitive heuristic to enforce hard constraints during policy execution. We provide theoretical analyses to show that our proposed approach would improve over CQL. We perform extensive experiments on two real-world tasks for diabetes and sepsis treatment optimization to compare performance of the proposed approach against prominent off-policy and offline RL baselines (DDQN and CQL). Across a range of principled and clinically relevant metrics, we show that our proposed approach enables substantial improvements in expected health outcomes and in consistency with relevant practice and safety guidelines.

翻译：在许多慢性疾病管理和关键护理应用中,对以数据驱动的方法对动态选择最佳治疗战略的兴趣日益浓厚,在许多慢性疾病管理和关键护理应用中,强化学习方法非常适合这一顺序决策问题,但必须完全在追溯性医疗记录数据集方面进行培训和评价,因为直接在线探索不安全且不可行。尽管有这一要求,绝大多数动态治疗优化研究使用非政策性RL方法(如双深网络或其变式),而人们知道这些方法纯粹在离线环境中表现不佳。脱线性学习方法最近的进展,如耐用性Q-学习(CQL),提供了适当的替代方案。但在将这些方法适应于真实世界应用方面仍然存在挑战,因为不理想的范例主宰了追溯性数据集,而且需要满足严格的安全限制。在这项工作中,我们引入了一种实用的过渡抽样方法,以解决离线性RL培训期间的行动不平衡,以及在执行政策期间执行硬性限制的不直觉。我们提供的理论分析表明,我们所提议的方法将改进CL-L-学习(C-L-L-学习)的相关做法,提供了一个合适的替代办法。但是,在现实-L-L-L-基线上,我们进行广泛的业绩实验的预期性实验,以显示我们关于S-R-L-L-R-L-D-L-L-L-L-L-L-L-S-S-S-S-S-S-S-L-L-S-L-S-L-L-S-L-L-L-S-S-S-S-S-L-L-L-L-S-S-S-S-S-S-S-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-L-L-S-S-S-L-L-S-S-L-L-L-S-L-L-S-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S