具有动态治疗制度政策适应的削弱作用的行为者-批评网络 (Deconfounding Actor-Critic Network with Policy Adaptation for Dynamic Treatment Regimes)

Despite intense efforts in basic and clinical research, an individualized ventilation strategy for critically ill patients remains a major challenge. Recently, dynamic treatment regime (DTR) with reinforcement learning (RL) on electronic health records (EHR) has attracted interest from both the healthcare industry and machine learning research community. However, most learned DTR policies might be biased due to the existence of confounders. Although some treatment actions non-survivors received may be helpful, if confounders cause the mortality, the training of RL models guided by long-term outcomes (e.g., 90-day mortality) would punish those treatment actions causing the learned DTR policies to be suboptimal. In this study, we develop a new deconfounding actor-critic network (DAC) to learn optimal DTR policies for patients. To alleviate confounding issues, we incorporate a patient resampling module and a confounding balance module into our actor-critic framework. To avoid punishing the effective treatment actions non-survivors received, we design a short-term reward to capture patients' immediate health state changes. Combining short-term with long-term rewards could further improve the model performance. Moreover, we introduce a policy adaptation method to successfully transfer the learned model to new-source small-scale datasets. The experimental results on one semi-synthetic and two different real-world datasets show the proposed model outperforms the state-of-the-art models. The proposed model provides individualized treatment decisions for mechanical ventilation that could improve patient outcomes.

翻译：尽管在基础和临床研究方面做出了大量努力,但针对重病患者的个性化通风战略仍然是一项重大挑战。最近,在电子健康记录(EHR)方面强化学习(RL)的动态治疗制度(DTR)已经吸引了保健行业和机器学习研究界的兴趣。然而,大多数学到的DTR政策可能由于混淆者的存在而有所偏颇。虽然一些非幸存者的治疗行动可能是有益的,但如果受迷惑者造成死亡,培训由长期结果(例如90天死亡率)指导的RL模式将惩罚那些导致所学的DTR政策低于最佳水平的治疗行动。在这个研究中,我们开发了一个新的分解的行为者-临床网络(DAC)来学习患者的最佳DTR政策。为了缓解纠结问题,我们将病人的重现模块和纠结平衡模块纳入我们的行为者-激进框架。为了避免惩罚拟议的有效治疗行动,我们设计了短期奖励,以捕捉病人短期健康状况的变化。将短期的DTR政策与长期的半报酬结合起来,我们开发了一种短期的模型与长期的模型,可以使实验性的结果进一步改进。