Causal inference is understood to be a very challenging problem with observational data alone. Without making additional strong assumptions, it is only typically possible given access to data arising from perturbing the underlying system. To identify causal relations among a collections of covariates and a target or response variable, existing procedures rely on at least one of the following assumptions: i) the target variable remains unperturbed, ii) the hidden variables remain unperturbed, and iii) the hidden effects are dense. In this paper, we consider a perturbation model for interventional data (involving soft and hard interventions) over a collection of Gaussian variables that does not satisfy any of these conditions and can be viewed as a mixed-effects linear structural causal model. We propose a maximum-likelihood estimator -- dubbed DirectLikelihood -- that exploits system-wide invariances to uniquely identify the population causal structure from perturbation data. Our theoretical guarantees also carry over to settings where the variables are non-Gaussian but are generated according to a linear structural causal model. Further, we demonstrate that the population causal parameters are solutions to a worst-case risk with respect to distributional shifts from a certain perturbation class. We illustrate the utility of our perturbation model and the DirectLikelihood estimator on synthetic data as well as real data involving protein expressions.
翻译:据理解,光观察数据就是一个极具挑战性的问题。在不做更多有力的假设的情况下,只有一般有可能获得从扰动基本系统产生的数据。为了确定共变数集与目标或反应变量之间的因果关系,现有程序至少依赖于以下假设之一:一)目标变量没有被扰动,二)隐藏变量仍然未受扰动,三)隐藏效应是密集的。在本文中,我们考虑的是干预数据(包括软和硬干预)的扰动模型(涉及软和硬干预)对收集高斯变量的干扰模型,这些变量不能满足任何这些条件,可被视为一种混合效应线性结构性因果关系模型。我们提议采用一个最大相似性估测算器 -- -- 隐含直接联系 -- -- 利用全系统的变量从扰动数据中独特地识别人口因果结构。我们理论上的保证还延续到各种结构变量不是伽西文,而是根据线性结构性结构模型生成的。此外,我们展示了从最差的数据分布到最差的数据分配方式,我们以最差的数据分配为直接性指标。我们展示了人口因果参数,从每级数据分配为最差的模型。