Inference of the marginal probability distribution is defined as the calculation of the probability of a subset of the variables and is relevant for handling missing data and hidden variables. While inference of the marginal probability distribution is crucial for various problems in machine learning and statistics, its exact computation is generally not feasible for categorical variables in Bayesian networks due to the NP-hardness of this task. We develop a divide-and-conquer approach using the graphical properties of Bayesian networks to split the computation of the marginal probability distribution into sub-calculations of lower dimensionality, thus reducing the overall computational complexity. Exploiting this property, we present an efficient and scalable algorithm for calculating the marginal probability distribution for categorical variables. The novel method is compared against state-of-the-art approximate inference methods in a benchmarking study, where it displays superior performance. As an immediate application, we demonstrate how our method can be used to classify incomplete data against Bayesian networks and use this approach for identifying the cancer subtype of kidney cancer patient samples.
翻译:边际概率分布的推论被定义为计算一个子变量的概率,与处理缺失的数据和隐藏变量有关。虽然边际概率分布的推论对于机器学习和统计中的各种问题至关重要,但由于这项任务的NP-硬性,它对于巴伊西亚网络中的绝对变量一般不可行。我们开发了一种分解和分解方法,使用巴伊西亚网络的图形属性,将边际概率分布的计算分为低维度子计算,从而降低总体计算复杂性。利用这一属性,我们为计算绝对变量的边际概率分布提供了高效和可缩放的算法。新颖方法与基准研究中最先进的近似推论方法进行了比较,在基准研究中,该方法表现优异。作为直接应用,我们展示了如何使用我们的方法对巴伊西亚网络的不完整数据进行分类,并使用这种方法确定肾癌患者样本的癌症子类型。