从含有系统性缺失值的数据中学习的贪婪结构 (Greedy structure learning from data that contain systematic missing values)

Learning from data that contain missing values represents a common phenomenon in many domains. Relatively few Bayesian Network structure learning algorithms account for missing data, and those that do tend to rely on standard approaches that assume missing data are missing at random, such as the Expectation-Maximisation algorithm. Because missing data are often systematic, there is a need for more pragmatic methods that can effectively deal with data sets containing missing values not missing at random. The absence of approaches that deal with systematic missing data impedes the application of BN structure learning methods to real-world problems where missingness are not random. This paper describes three variants of greedy search structure learning that utilise pairwise deletion and inverse probability weighting to maximally leverage the observed data and to limit potential bias caused by missing values. The first two of the variants can be viewed as sub-versions of the third and best performing variant, but are important in their own in illustrating the successive improvements in learning accuracy. The empirical investigations show that the proposed approach outperforms the commonly used and state-of-the-art Structural EM algorithm, both in terms of learning accuracy and efficiency, as well as both when data are missing at random and not at random.

翻译：从含有缺失值的数据中学习数据是许多领域的一个常见现象。相对较少的巴伊西亚网络结构学习算法对缺失数据进行解释, 而那些往往依赖标准方法来假设缺失数据的人则会随机丢失, 比如期望- 最大化算法。由于缺失数据往往是系统性的, 需要更务实的方法来有效地处理包含缺失值的数据集, 这些数据并非随机丢失。缺乏处理系统性缺失数据的方法会妨碍将 BN 结构学习方法应用于不存在随机缺失的真实世界问题。本文描述了贪婪搜索结构学习的三个变体, 利用对称删除和反正概率加权来最大限度地利用观察到的数据, 并限制缺失值造成的潜在偏差。前两个变体可以被视为第三个和最佳运行变体的子版本, 但对于说明学习准确性方面的连续改进很重要。实证调查显示, 所拟议的方法在学习准确性和效率方面, 以及当数据在随机和随机丢失时, 也是随机的, 。