Successful machine learning methods require a trade-off between memorization and generalization. Too much memorization and the model cannot generalize to unobserved examples. Too much over-generalization and we risk under-fitting the data. While we commonly measure their performance through cross validation and accuracy metrics, how should these algorithms cope in domains that are extremely under-determined where accuracy is always unsatisfactory? We present a novel probabilistic graphical model structure learning approach that can learn, generalize and explain in these elusive domains by operating at the random variable instantiation level. Using Minimum Description Length (MDL) analysis, we propose a new decomposition of the learning problem over all training exemplars, fusing together minimal entropy inferences to construct a final knowledge base. By leveraging Bayesian Knowledge Bases (BKBs), a framework that operates at the instantiation level and inherently subsumes Bayesian Networks (BNs), we develop both a theoretical MDL score and associated structure learning algorithm that demonstrates significant improvements over learned BNs on 40 benchmark datasets. Further, our algorithm incorporates recent off-the-shelf DAG learning techniques enabling tractable results even on large problems. We then demonstrate the utility of our approach in a significantly under-determined domain by learning gene regulatory networks on breast cancer gene mutational data available from The Cancer Genome Atlas (TCGA).
翻译:成功的机器学习方法需要在记忆和概括之间作出权衡。 太多的记忆和模型无法概括为不可见的例子。 太多的超大和我们有可能不适应数据。 虽然我们通常通过交叉验证和准确度衡量来测量它们的绩效, 但是这些算法应该如何在那些总是不准确的极低确定的领域应付? 我们提出了一个新颖的概率图形模型结构学习方法, 通过随机的可变即时化水平操作, 可以在这些难以捉摸的领域学习、 概括和解释。 使用最低描述长度( MDL) 分析, 我们建议在所有培训的模拟器中重新分解学习问题, 将最小的变异性推论一起用于构建最终的知识库。 我们的算法通过利用Bayesian知识库(BKBBBs)这个在即时运行水平上运行的架构, 以及自然的子债券网络, 我们开发了一个理论性MDL方法和相关的结构学习算法, 以显示在40个基准数据集中学习的BN(MDL) 的显著改进。 更进一步, 我们的算法将最新的BS- helf- helfalalal 学习我们最新的系统数据库中的大系统数据。</s>