VertiBayes:从缺少数值的垂直分割数据中学习 Bayesian 网络参数 (VertiBayes: Learning Bayesian network parameters from vertically partitioned data with missing values)

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

Federated learning makes it possible to train a machine learning model on decentralized data. Bayesian networks are probabilistic graphical models that have been widely used in artificial intelligence applications. Their popularity stems from the fact they can be built by combining existing expert knowledge with data and are highly interpretable, which makes them useful for decision support, e.g. in healthcare. While some research has been published on the federated learning of Bayesian networks, publications on Bayesian networks in a vertically partitioned or heterogeneous data setting (where different variables are located in different datasets) are limited, and suffer from important omissions, such as the handling of missing data. In this article, we propose a novel method called VertiBayes to train Bayesian networks (structure and parameters) on vertically partitioned data, which can handle missing values as well as an arbitrary number of parties. For structure learning we adapted the widely used K2 algorithm with a privacy-preserving scalar product protocol. For parameter learning, we use a two-step approach: first, we learn an intermediate model using maximum likelihood by treating missing values as a special value and then we train a model on synthetic data generated by the intermediate model using the EM algorithm. The privacy guarantees of our approach are equivalent to the ones provided by the privacy preserving scalar product protocol used. We experimentally show our approach produces models comparable to those learnt using traditional algorithms and we estimate the increase in complexity in terms of samples, network size, and complexity. Finally, we propose two alternative approaches to estimate the performance of the model using vertically partitioned data and we show in experiments that they lead to reasonably accurate estimates.

翻译：联邦学习使得有可能在分散数据方面培训机器学习模式。巴伊西亚网络是被广泛用于人工智能应用的概率性图形模型,其受欢迎性来自一个事实,即它们可以通过将现有专家知识与数据相结合来建立,并且高度可解释,从而有利于决策支持,例如医疗保健。虽然一些研究已经发表在巴伊西亚网络的联合会式学习中,但关于巴伊西亚网络的纵向分隔或混杂数据设置(不同变量位于不同的数据集中)的出版物是有限的,并且有重要的遗漏,例如处理缺失数据。在本篇文章中,我们提出了一个名为VertiBayes的新方法,用于培训巴伊西亚网络的纵向分割数据(结构和参数),这可以处理缺失的价值以及任意数量。在结构学上,我们用一个隐私保存卡力产品模型来调整广泛使用的K2算法。关于参数的学习,我们采用两步方法:首先,我们学习一个中间模型,通过将缺失值作为特殊价值处理,然后用我们用一个可比的模型来训练一个中间模型,然后我们用一个模型来测试一个我们使用的合成数据模型,我们用来显示我们使用的模型的模型,最后的模型,我们用来显示我们用来显示我们所制作的精确的模型。