We consider a setting with $N$ heterogeneous units and $p$ interventions. Our goal is to learn unit-specific potential outcomes for any combination of these $p$ interventions, i.e., $N \times 2^p$ causal parameters. Choosing combinations of interventions is a problem that naturally arises in many applications such as factorial design experiments, recommendation engines (e.g., showing a set of movies that maximizes engagement for users), combination therapies in medicine, selecting important features for ML models, etc. Running $N \times 2^p$ experiments to estimate the various parameters is infeasible as $N$ and $p$ grow. Further, with observational data there is likely confounding, i.e., whether or not a unit is seen under a combination is correlated with its potential outcome under that combination. To address these challenges, we propose a novel model that imposes latent structure across both units and combinations. We assume latent similarity across units (i.e., the potential outcomes matrix is rank $r$) and regularity in how combinations interact (i.e., the coefficients in the Fourier expansion of the potential outcomes is $s$ sparse). We establish identification for all causal parameters despite unobserved confounding. We propose an estimation procedure, Synthetic Combinations, and establish finite-sample consistency under precise conditions on the observation pattern. Our results imply Synthetic Combinations consistently estimates unit-specific potential outcomes given $\text{poly}(r) \times (N + s^2p)$ observations. In comparison, previous methods that do not exploit structure across both units and combinations have sample complexity scaling as $\min(N \times s^2p, \ \ r \times (N + 2^p))$. We use Synthetic Combinations to propose a data-efficient experimental design mechanism for combinatorial causal inference. We corroborate our theoretical findings with numerical simulations.
翻译:我们考虑一个具有 $N$ 个异质单位和 $p$ 个干预的情景。我们的目标是学习出针对这 $p$ 个干预的任意组合的单位特定潜在结果,即 $N \times 2^p$ 个因果参数。在许多应用中,如阶乘设计实验、推荐引擎(例如,为用户提供最大参与度的电影集合)、医学中的复合疗法、选择ML模型重要特征等,选择干预的组合是一个自然而然出现的问题。随着 $N$ 和 $p$ 的增长,进行 $N \times 2^p$ 个实验以估计各种参数是不可行的。此外,对于观测数据而言,可能存在混淆因素,即是否在组合下看到一个单位与该组合下它的潜在结果是否存在相关性。为了解决这些挑战,我们提出了一个新颖的模型,它在单位和组合之间施加了潜在结构。我们假设单位之间存在潜在的相似性(即,潜在结果矩阵的秩为 $r$),并且在组合相互作用时具有规律性(即,潜在结果的傅里叶展开系数是 $s$ 稀疏的)。尽管存在未观察到的混淆因素,但我们确定了所有因果参数的识别。我们提出了一个估计过程——合成组合,并在观测模式精确条件下证明了有限样本一致性。我们的结果表明,在给定 $\text{poly}(r) \times (N + s^2p)$ 观测的情况下,合成组合会一致地估计出单位特定的潜在结果。与之相比,先前的方法没有利用单位和组合之间的结构,其样本复杂度随着 $\min(N \times s^2p, \ \ r \times (N + 2^p))$ 缩放。我们使用合成组合提出了一种数据有效的组合因果推断实验设计机制。我们通过数值模拟验证了我们的理论发现。