The graph of a Bayesian Network (BN) can be machine learned, determined by causal knowledge, or a combination of both. In disciplines like bioinformatics, applying BN structure learning algorithms can reveal new insights that would otherwise remain unknown. However, these algorithms are less effective when the input data are limited in terms of sample size, which is often the case when working with real data. This paper focuses on purely machine learned and purely knowledge-based BNs and investigates their differences in terms of graphical structure and how well the implied statistical models explain the data. The tests are based on four previous case studies whose BN structure was determined by domain knowledge. Using various metrics, we compare the knowledge-based graphs to the machine learned graphs generated from various algorithms implemented in TETRAD spanning all three classes of learning. The results show that, while the algorithms produce graphs with much higher model selection score, the knowledge-based graphs are more accurate predictors of variables of interest. Maximising score fitting is ineffective in the presence of limited sample size because the fitting becomes increasingly distorted with limited data, guiding algorithms towards graphical patterns that share higher fitting scores and yet deviate considerably from the true graph. This highlights the value of causal knowledge in these cases, as well as the need for more appropriate fitting scores suitable for limited data. Lastly, the experiments also provide new evidence that support the notion that results from simulated data tell us little about actual real-world performance.
翻译:Bayesian 网络(BN) 的图形可以被机器学习, 由因果知识决定, 或者两者兼而有之。 在生物信息学等学科中, 应用 BN 结构学习算法可以揭示新的洞察力。 但是, 当输入数据在样本规模上受到限制时, 这些算法就不太有效, 这在使用真实数据时通常是这样。 本文侧重于纯粹的机器学习的纯知识型基于知识的 BN 图形, 并调查其在图形结构以及隐含的统计模型如何解释数据方面的差异。 这些测试基于生物信息学, 应用 BN 结构学习算法的算法可以揭示出新的洞察力。 使用不同的尺度, 我们用基于知识的图表来比较来自TETRAD所有三个学习阶段的各种算法的图表。 结果显示, 虽然这些算法生成的图表与模型选择得分相比要高得多, 但基于知识的图表是更精确的变量。 在有限的样本规模中, 最精确的评分是无效的, 因为由于精确的数据变得日益扭曲,, 将算算方法要向真实的模型显示, 更精确的精确的模型显示, 正确的数据显示, 正确的数据需要这些精确的精确的精确的数据 。