Randomized controlled trials (RCTs) are the gold standard for evaluating the causal effect of a treatment; however, they often have limited sample sizes and sometimes poor generalizability. On the other hand, non-randomized, observational data derived from large administrative databases have massive sample sizes and better generalizability, but they are prone to unmeasured confounding bias. It is thus of considerable interest to reconcile effect estimates obtained from randomized controlled trials and observational studies investigating the same intervention, potentially harvesting the best from both realms. In this paper, we theoretically characterize the potential efficiency gain of integrating observational data into the RCT-based analysis from a minimax point of view. For estimation, we derive the minimax rate of convergence for the mean squared error, and propose a fully adaptive anchored thresholding estimator that attains the optimal rate up to poly-log factors. For inference, we characterize the minimax rate for the length of confidence intervals and show that adaptation (to unknown confounding bias) is in general impossible. A curious phenomenon thus emerges: for estimation, the efficiency gain from data integration can be achieved without prior knowledge on the magnitude of the confounding bias; for inference, the same task becomes information-theoretically impossible in general. We corroborate our theoretical findings using simulations and a real data example from the RCT DUPLICATE initiative [Franklin et al., 2021b].
翻译:随机控制试验(RCTs)是评估治疗的因果关系的黄金标准;然而,它们往往具有有限的抽样规模,有时一般性差;另一方面,非随机性,大型行政数据库的观测数据具有庞大的抽样规模,而且更具有一般性,但它们容易产生无法测量的混乱偏差;因此,非常有兴趣调和随机控制试验和观察研究得出的影响估计,对同一干预措施进行调查,有可能从两个领域获取最佳结果。在本文中,我们理论上从微缩角度说明将观测数据纳入以RCT为基础的分析的潜在效率收益。关于估计,我们从平均正方差中得出最小的趋同率,并提出完全适应性的固定阈值,从而达到最佳比率,达到多种因素。关于推断,我们用信任期长度的微缩增速率,并表明适应(未知的相近偏差)是不可能的。因此出现了一种奇怪的现象:从估计中可以推断,从中得出的平均数据整合率的最小速度,而没有利用先前水平的理论性分析,我们利用了先前的准确性数据。