矛盾的动态治疗制度:强化学习方法 (Ambiguous Dynamic Treatment Regimes: A Reinforcement Learning Approach)

A main research goal in various studies is to use an observational data set and provide a new set of counterfactual guidelines that can yield causal improvements. Dynamic Treatment Regimes (DTRs) are widely studied to formalize this process. However, available methods in finding optimal DTRs often rely on assumptions that are violated in real-world applications (e.g., medical decision-making or public policy), especially when (a) the existence of unobserved confounders cannot be ignored, and (b) the unobserved confounders are time-varying (e.g., affected by previous actions). When such assumptions are violated, one often faces ambiguity regarding the underlying causal model that is needed to be assumed to obtain an optimal DTR. This ambiguity is inevitable, since the dynamics of unobserved confounders and their causal impact on the observed part of the data cannot be understood from the observed data. Motivated by a case study of finding superior treatment regimes for patients who underwent transplantation in our partner hospital and faced a medical condition known as New Onset Diabetes After Transplantation (NODAT), we extend DTRs to a new class termed Ambiguous Dynamic Treatment Regimes (ADTRs), in which the casual impact of treatment regimes is evaluated based on a "cloud" of potential causal models. We then connect ADTRs to Ambiguous Partially Observable Mark Decision Processes (APOMDPs) proposed by Saghafian (2018), and develop two Reinforcement Learning methods termed Direct Augmented V-Learning (DAV-Learning) and Safe Augmented V-Learning (SAV-Learning), which enable using the observed data to efficiently learn an optimal treatment regime. We establish theoretical results for these learning methods, including (weak) consistency and asymptotic normality. We further evaluate the performance of these learning methods both in our case study and in simulation experiments.

翻译：各种研究的一项主要研究目标是使用观察数据集,并提供一套新的反事实准则,以产生因果关系的改善。动态治疗制度(DTRs)被广泛研究,以便正式确定这一进程。然而,找到最佳DTRs的现有方法往往依赖于在现实应用(如医疗决策或公共政策)中被违反的假设,特别是当(a) 存在未观察到的困惑者是无法忽视的,以及(b) 未观测的困惑者是时间变化的(例如,受先前行动影响)。当这些假设被违反时,人们往往在为获得最佳DTR(D)所必须假定的基本因果模式上面临模糊不清。但这种模糊性是不可避免的,因为从观察的数据中无法理解未观测到的混淆者的动态及其对观察到的数据部分的因果关系影响。我们通过一项案例研究找到在伙伴医院接受移植的病人的高级治疗制度,并面临被称为“移植后新糖尿病评估”的医学条件(NOSTATT),我们把DTRS-Oralder-Oral 治疗方法扩展为“Aright-ral-Ial-Ialal AS AS dealal maisal disal ” 一种“我们现在的正常处理方法,我们通过对正常的正常的学习过程进行这种评估的正常的正常评估,我们进行这种评估的正常的正常的正常的学习的学习制度,我们既能学的学习。