Bayesian Networks are probabilistic graphical models that can compactly represent dependencies among random variables. Missing data and hidden variables require calculating the marginal probability distribution of a subset of the variables. While knowledge of the marginal probability distribution is crucial for various problems in statistics and machine learning, its exact computation is generally not feasible for categorical variables due to the NP-hardness of this task. We develop a divide-and-conquer approach using the graphical properties of Bayesian networks to split the computation of the marginal probability distribution into sub-calculations of lower dimensionality, reducing the overall computational complexity. Exploiting this property, we present an efficient and scalable algorithm for estimating the marginal probability distribution for categorical variables. The novel method is compared against state-of-the-art approximate inference methods in a benchmarking study, where it displays superior performance. As an immediate application, we demonstrate how the marginal probability distribution can be used to classify incomplete data against Bayesian networks and use this approach for identifying the cancer subtype of kidney cancer patient samples.
翻译:Bayesian 网络是概率化的图形模型,可以代表随机变量之间的依赖性。 缺少的数据和隐藏变量需要计算一个子变量的边际概率分布。 虽然了解边际概率分布对于统计和机器学习中的各种问题至关重要, 但由于这项任务的NP- 硬性, 其精确计算对于绝对变量来说一般不可行。 我们开发了一种分而解的方法, 使用Bayesian 网络的图形属性, 将边际概率分布的计算分解为低维度子计算, 降低总体计算复杂性。 开发此属性时, 我们为估算绝对变量的边际概率分布提供了有效和可扩缩的算法。 新的方法与基准研究中最先进的近似推论方法进行了比较, 其表现优异。 作为直接应用, 我们展示了如何使用边际概率分布来对Bayesian 网络的不完整数据进行分类, 并使用这一方法确定肾癌患者样本的子类型。