We consider the problem of learning the structure of a causal directed acyclic graph (DAG) model in the presence of latent variables. We define latent factor causal models (LFCMs) as a restriction on causal DAG models with latent variables, which are composed of clusters of observed variables that share the same latent parent and connections between these clusters given by edges pointing from the observed variables to latent variables. LFCMs are motivated by gene regulatory networks, where regulatory edges, corresponding to transcription factors, connect spatially clustered genes. We show identifiability results on this model and design a consistent three-stage algorithm that discovers clusters of observed nodes, a partial ordering over clusters, and finally, the entire structure over both observed and latent nodes. We evaluate our method in a synthetic setting, demonstrating its ability to almost perfectly recover the ground truth clustering even at relatively low sample sizes, as well as the ability to recover a significant number of the edges from observed variables to latent factors. Finally, we apply our method in a semi-synthetic setting to protein mass spectrometry data with a known ground truth network, and achieve almost perfect recovery of the ground truth variable clusters.
翻译:我们考虑了在存在潜在变量的情况下学习因果定向循环图(DAG)模型结构的问题。我们将潜在因果因果模型(LFCMs)定义为限制带有潜在变量的因果DAG模型(LFCMs),该模型由观测到的变量组成的一组变量组成,这些变量具有相同的潜在母体,这些群体之间的关联由从观察到的变量到潜在变量的边缘所给出。LFCMs是由基因监管网络驱动的,这些网络的监管边缘与转录系数相对应,将空间集群基因连接在一起。我们在这个模型上展示了可识别性结果,并设计了一个连续的三阶段算法,以发现观测到的节点的组群,对集群进行部分排序,最后,对观测到的和潜在节点的整个结构进行整体结构。我们在一个合成环境中评估了我们的方法,表明它几乎完全恢复地面真相群的能力,即使在相对较低的采样大小,以及从观测到的变量的边缘到潜在因素的大量边缘的能力。最后,我们将我们的方法应用于与已知的地面真理网中蛋质质质质质质谱测量数据组的半合成的组合。