半监督政策强化学习 (Semi-Supervised Off Policy Reinforcement Learning)

Reinforcement learning (RL) has shown great success in estimating sequential treatment strategies which account for patient heterogeneity. However, health-outcome information is often not well coded but rather embedded in clinical notes. Extracting precise outcome information is a resource intensive task. This translates into only small well-annotated cohorts available. We propose a semi-supervised learning (SSL) approach that can efficiently leverage a small sized labeled data $\mathcal{L}$ with true outcome observed, and a large sized unlabeled data $\mathcal{U}$ with outcome surrogates $\pmb W$. In particular we propose a theoretically justified SSL approach to Q-learning and develop a robust and efficient SSL approach to estimating the value function of the derived optimal STR, defined as the expected counterfactual outcome under the optimal STR. Generalizing SSL to learning STR brings interesting challenges. First, the feature distribution for predicting $Y_t$ is unknown in the $Q$-learning procedure, as it includes unknown $Y_{t-1}$ due to the sequential nature. Our methods for estimating optimal STR and its associated value function, carefully adapts to this sequentially missing data structure. Second, we modify the SSL framework to handle the use of surrogate variables $\pmb W$ which are predictive of the outcome through the joint law $\mathbb{P}_{Y,\pmb O,\pmb W}$, but are not part of the conditional distribution of interest $\mathbb{P}_{Y|\pmb O}$. We provide theoretical results to understand when and to what degree efficiency can be gained from $\pmb W$ and $\pmb O$. Our approach is robust to misspecification of the imputation models. Further, we provide a doubly robust value function estimator for the derived STR. If either the Q functions or the propensity score functions are correctly specified, our value function estimators are consistent for the true value function.

翻译：强化学习( RL) 显示在估算符合患者异质性要求的连续治疗策略方面非常成功。然而, 健康结果信息往往没有很好地编码, 而是嵌入临床笔记。提取精确结果信息是一项资源密集型任务。将精确结果信息转化成仅能提供精密附加说明的组群。我们建议采用半监督的学习( SSL) 方法, 可以有效调用一个小型标签数据 $\ mathcal{L}, 并观察到真实结果, 并且有大量的未标记的数据 $\ mathcal{U} 。但是, 在 $Q- 学习程序里, 包括未知的 $\ pm 值美元。 Q- pm 的快速数据值, 特别是我们提出一个理论上合理的 SSL 方法, 来估算衍生的最佳 STSL 功能。将 SL 的预期的反效果结果定义为在最佳的ST. glationalization to resual a listal ma the restitual modeal motional motion madeal mation macal macal the the modeal modeal mod the motional modal modeal motional modeal.