We consider the problem of causal structure learning in the setting of heterogeneous populations, i.e., populations in which a single causal structure does not adequately represent all population members, as is common in biological and social sciences. To this end, we introduce a distance covariance-based kernel designed specifically to measure the similarity between the underlying nonlinear causal structures of different samples. This kernel enables us to perform clustering to identify the homogeneous subpopulations. Indeed, we prove the corresponding feature map is a statistically consistent estimator of nonlinear independence structure, rendering the kernel itself a statistical test for the hypothesis that sets of samples come from different generating causal structures. We can then use existing methods to learn a causal structure for each of these subpopulations. We demonstrate using our kernel for causal clustering with an application in genetics, allowing us to reason about the latent transcription factor networks regulating measured gene expression levels.
翻译:我们考虑不同人口群体(即生物和社会科学中常见的单一因果结构不能充分代表所有人口成员的人口)的因果结构学习问题,为此,我们引入了远程共变内核,专门用来测量不同样本的内在非线性因果结构之间的相似性。这个内核使我们能够进行集群,以识别同质亚人口。事实上,我们证明相应的地貌图是统计上一致的非线性独立结构的估测者,使内核本身成为从不同因果结构中产生样本的假设的统计测试。然后,我们可以利用现有方法来学习每种亚类人口的因果结构。我们展示了利用内核进行因果集群的遗传学应用,让我们了解关于测量基因表达水平的潜在抄录系数网络。