Given only data generated by a standard confounding graph with unobserved confounder, the Average Treatment Effect (ATE) is not identifiable. To estimate the ATE, a practitioner must then either (a) collect deconfounded data;(b) run a clinical trial; or (c) elucidate further properties of the causal graph that might render the ATE identifiable. In this paper, we consider the benefit of incorporating a large confounded observational dataset (confounder unobserved) alongside a small deconfounded observational dataset (confounder revealed) when estimating the ATE. Our theoretical results suggest that the inclusion of confounded data can significantly reduce the quantity of deconfounded data required to estimate the ATE to within a desired accuracy level. Moreover, in some cases -- say, genetics -- we could imagine retrospectively selecting samples to deconfound. We demonstrate that by actively selecting these samples based upon the (already observed) treatment and outcome, we can reduce sample complexity further. Our theoretical and empirical results establish that the worst-case relative performance of our approach (vs. a natural benchmark) is bounded while our best-case gains are unbounded. Finally, we demonstrate the benefits of selective deconfounding using a large real-world dataset related to genetic mutation in cancer.
翻译:鉴于只有与未观测到的混淆图产生的标准数据,平均治疗效果(ATE)无法识别。为了估算ATE,执业者必须要么(a)收集无根据数据;(b)进行临床试验;或者(c)进一步阐明因果图的属性,从而可能使ATE可以识别。在本文件中,我们认为,在估算ATE时,将大量无根据的观察数据集(未观测者)与小型无根据的观察数据集(已披露者)合并在一起的好处是无法识别的。我们的理论结果表明,纳入无根据数据可以大大减少估计ATE所需的无根据数据数量,达到预期的准确程度。此外,在某些情况下,比如,遗传学,我们可以想象追溯性地选择样本,以便分解。我们通过根据(已观测到的)治疗和结果积极选择这些样本,我们可以进一步降低样本的复杂性。我们的理论和实验结果证明,我们的方法(如发现,自然基准)最差的相对性能可以大大地减少数据的数量。最后,我们用最差的遗传学得分化的数据来证明,我们最差的遗传学得的得分化的结果。