Causal discovery, the inference of causal relations from data, is a core task of fundamental importance in all scientific domains, and several new machine learning methods for addressing the causal discovery problem have been proposed recently. However, existing machine learning methods for causal discovery typically require that the data used for inference is pooled and available in a centralized location. In many domains of high practical importance, such as in healthcare, data is only available at local data-generating entities (e.g. hospitals in the healthcare context), and cannot be shared across entities due to, among others, privacy and regulatory reasons. In this work, we address the problem of inferring causal structure - in the form of a directed acyclic graph (DAG) - from a distributed data set that contains both observational and interventional data in a privacy-preserving manner by exchanging updates instead of samples. To this end, we introduce a new federated framework, FED-CD, that enables the discovery of global causal structures both when the set of intervened covariates is the same across decentralized entities, and when the set of intervened covariates are potentially disjoint. We perform a comprehensive experimental evaluation on synthetic data that demonstrates that FED-CD enables effective aggregation of decentralized data for causal discovery without direct sample sharing, even when the contributing distributed data sets cover disjoint sets of interventions. Effective methods for causal discovery in distributed data sets could significantly advance scientific discovery and knowledge sharing in important settings, for instance, healthcare, in which sharing of data across local sites is difficult or prohibited.
翻译:原因的发现,即数据中因果关系的推断,是所有科学领域具有根本重要性的一项核心任务,最近提出了解决因果发现问题的若干新机器学习方法,然而,现有因果发现机学习方法通常要求将用于推断的数据集中起来,在一个集中地点提供。在许多具有高度实际重要性的领域,例如保健领域,数据只提供给地方数据产生实体(例如保健领域的医院),由于隐私和监管等原因,无法在各实体之间共享。在这项工作中,我们处理从分布式数据集中推断因果结构(以定向循环图(DAG)的形式)的问题,该数据集包含观察和干预性数据,以保密方式交换最新数据,而不是样本。为此,我们引入一个新的联邦化框架(FED-CD-CD),当地方的干预组合与分散式混合时,当干预组合组合组合有可能断裂时,我们通过在分布式周期性循环图中进行全面分享,从而能够直接分享关于因果的因果数据。我们通过共享综合数据,通过共享,将有效的因果数据用于共享综合数据,从而显示,在综合数据集中进行直接分享。