In many investigations, the primary outcome of interest is difficult or expensive to collect. Examples include long-term health effects of medical interventions, measurements requiring expensive testing or follow-up, and outcomes only measurable on small panels as in marketing. This reduces effective sample sizes for estimating the average treatment effect (ATE). However, there is often an abundance of observations on surrogate outcomes not of primary interest, such as short-term health effects or online-ad click-through. We study the role of such surrogate observations in the efficient estimation of treatment effects. To quantify their value, we derive the semiparametric efficiency bounds on ATE estimation with and without the presence of surrogates and several intermediary settings. The difference between these characterizes the efficiency gains from optimally leveraging surrogates. We study two regimes: when the number of surrogate observations is comparable to primary-outcome observations and when the former dominates the latter. We take an agnostic missing-data approach circumventing strong surrogate conditions previously assumed. To leverage surrogates' efficiency gains, we develop efficient ATE estimation and inference based on flexible machine-learning estimates of nuisance functions appearing in the influence functions we derive. We empirically demonstrate the gains by studying the long-term earnings effect of job training.
翻译:在许多调查中,利益的主要结果难以收集或昂贵,例如医疗干预的长期健康影响、需要昂贵的检测或后续措施,以及只在市场营销时小板块上可以衡量的结果,这减少了估计平均治疗效果的有效抽样规模;然而,往往有大量关于非主要利益替代结果的观察,例如短期健康影响或在线点击;我们研究这种替代观察在有效估计治疗效果方面的作用;为了量化其价值,我们从评估估计中得出半对称效率界限,有的则没有代孕和若干中间环境存在。这些差别的特点在于最佳利用代孕效果的效率收益。我们研究两种制度:当代孕观察的数量与初生观察相仿时,当前者在后者占主导地位时。我们采用一种典型缺失的数据方法,绕过先前假定的强有力的代孕效果。为了利用效率收益,我们根据灵活的机器学习估计结果和若干中间环境来制定有效的估计和推断。我们根据这些差异的特征是最佳利用代孕效果的代孕效果。我们研究了两种制度:当代孕观察次数与初生观察,当前者在后者占主导地位时,我们采取一种典型数据方法,绕过以前假定的强的附加条件。我们利用效率收益,我们根据灵活的机器学习的收益来制定有效的估计。