Causal inference is a challenging problem with observational data alone. The task becomes easier when having access to data from perturbing the underlying system, even when happening in a non-randomized way: this is the setting we consider, encompassing also latent confounding variables. To identify causal relations among a collections of covariates and a response variable, existing procedures rely on at least one of the following assumptions: i) the response variable remains unperturbed, ii) the latent variables remain unperturbed, and iii) the latent effects are dense. In this paper, we examine a perturbation model for interventional data, which can be viewed as a mixed-effects linear structural causal model, over a collection of Gaussian variables that does not satisfy any of these conditions. We propose a maximum-likelihood estimator -- dubbed DirectLikelihood -- that exploits system-wide invariances to uniquely identify the population causal structure from unspecific perturbation data, and our results carry over to linear structural causal models without requiring Gaussianity. We illustrate the utility of our framework on synthetic data as well as real data involving California reservoirs and protein expressions.
翻译:光是观测数据就是一个具有挑战性的问题。 当获取来自干扰基础系统的数据时,任务就变得更容易了,即使以非随机的方式发生:这就是我们所考虑的环境,包含潜在的混杂变量。为了确定共变和响应变量集合之间的因果关系,现有程序至少依赖于以下假设之一:(一) 反应变量仍然不受干扰,二) 潜在变量仍然不受扰动,以及(三) 潜在影响是密集的。在本文中,我们研究了干预数据的扰动模型,这可以被视为一种混合效应的线性结构性因果模型,而不是收集不能满足任何这些条件的高斯变量。我们提出了一个最大相似度估测算器 -- -- 缩略式直接资产 -- -- 利用整个系统的易变异性来从不具体的扰动数据中单独确定人口因果结构,而我们的结果又延续到直线性结构性因果模型,而无需高斯因果。我们要说明我们框架在合成数据方面的实用性,作为真实数据库和蛋白质的表达方式。