A notable challenge of leveraging Electronic Health Records (EHR) for treatment effect assessment is the lack of precise information on important clinical variables, including the treatment received and the response. Both treatment information and response often cannot be accurately captured by readily available EHR features and require labor intensive manual chart review to precisely annotate, which limits the number of available gold standard labels on these key variables. We consider average treatment effect (ATE) estimation under such a semi-supervised setting with a large number of unlabeled samples containing both confounders and imperfect EHR features for treatment and response. We derive the efficient influence function for ATE and use it to construct a semi-supervised multiple machine learning (SMMAL) estimator. We showcase that our SMMAL estimator is semi-parametric efficient with B-spline regression under low-dimensional smooth models. We develop the adaptive sparsity/model doubly robust estimation under high-dimensional logistic propensity score and outcome regression models. Results from simulation studies support the validity of our SMMAL method and its superiority over supervised benchmarks.
翻译:利用电子健康记录(EHR)进行治疗效果评估的一个显著挑战是缺乏重要临床变量的准确信息,包括所接受的治疗和反应,治疗信息和反应往往无法以现成的EHR特征准确掌握,需要人工密集的人工图表审查才能准确说明,从而限制这些关键变量上现有的黄金标准标签的数量。在这种半监督的环境下,我们考虑平均治疗效果(ATE)估计,这种半监督的环境下有大量未贴标签的样本,既含有混杂物,也含有不完善的治疗和反应的EHR特征。我们从中获取对ATE的有效影响功能,并利用它来建立一个半监督的多机学习(SMMAL)估计器。我们展示我们的SMMAL估计器在低维平滑模型下具有半参数性效率,与B-Spline回归具有半参数效率。我们根据高维后勤敏度分和结果回归模型开发适应性强度/模型。模拟研究的结果支持我们的SMMAL方法及其优于监督基准的有效性。