Consider the problem of determining the effect of a compound on a specific cell type. To answer this question, researchers traditionally need to run an experiment applying the drug of interest to that cell type. This approach is not scalable: given a large number of different actions (compounds) and a large number of different contexts (cell types), it is infeasible to run an experiment for every action-context pair. In such cases, one would ideally like to predict the outcome for every pair while only having to perform experiments on a small subset of pairs. This task, which we label "causal imputation", is a generalization of the causal transportability problem. To address this challenge, we extend the recently introduced synthetic interventions (SI) estimator to handle more general data sparsity patterns. We prove that, under a latent factor model, our estimator provides valid estimates for the causal imputation task. We motivate this model by establishing a connection to the linear structural causal model literature. Finally, we consider the prominent CMAP dataset in predicting the effects of compounds on gene expression across cell types. We find that our estimator outperforms standard baselines, thus confirming its utility in biological applications.
翻译:考虑确定化合物对特定单元格类型的影响问题。 回答这个问题, 研究人员传统上需要对特定单元格类型应用相关药物进行实验。 这种方法无法伸缩: 鉴于许多不同的动作( commounds) 和大量不同的背景( 细胞类型), 无法对每种行动- 文本配对进行实验。 在这种情况下, 最好先预测每对的产物结果, 而只需对一小组的对子进行实验即可。 这个我们称为“ 碱性计算” 的任务就是因果传输问题的概括性。 为了应对这一挑战, 我们扩展最近引入的合成干预( SI) 估计器, 以处理更多的一般数据宽度模式。 我们证明, 在一种潜在因素模型下, 我们的估测器为因果估算任务提供了有效的估计值。 我们通过建立线性结构因果关系模型文献来激励这一模型。 最后, 我们考虑到在预测混合物对细胞类型基因表达的效果时的突出的 CMAP数据集。 我们发现, 我们的测算器在生物模型基准中, 证实了它比喻标准基线 。