Confounding remains one of the major challenges to causal inference with observational data. This problem is paramount in medicine, where we would like to answer causal questions from large observational datasets like electronic health records (EHRs). Modern medical data (such as EHRs) typically contain tens of thousands of covariates. Such a large set carries hope that many of the confounders are directly measured, and further hope that others are indirectly measured through their correlation with measured covariates. How can we exploit these large sets of covariates for causal inference? To help answer this question, this paper examines the performance of the large-scale propensity score (LSPS) approach on causal analysis of medical data. We demonstrate that LSPS may adjust for indirectly measured confounders by including tens of thousands of covariates that may be correlated with them. We present conditions under which LSPS removes bias due to indirectly measured confounders, and we show that LSPS may avoid bias when inadvertently adjusting for variables (like colliders) that otherwise can induce bias. We demonstrate the performance of LSPS with both simulated medical data and real medical data.
翻译:混淆仍然是观测数据中因果推断的主要挑战之一。 这个问题在医学中最为重要, 我们想从大量观察数据集(如电子健康记录)中解答因果问题。 现代医学数据(如EHRs)通常包含数万种共变体。 如此庞大的医疗数据(如EHRs)通常包含数万种共变体。 这种庞大的医学数据(如EHRs)带来了希望, 许多共变体能够直接测量, 并且进一步希望其他人能够通过其与测量的共变体的相互关系间接测量。 我们如何利用这些大组共变体进行因果关系推断? 为了帮助解答这个问题, 本文审视了医疗数据因因果关系分析而采用大规模适应性评分(LSPS)方法的性能。 我们通过模拟医学数据和真实医学数据来显示LSPSPS的性能。