Decision Trees are some of the most popular machine learning models today due to their out-of-the-box performance and interpretability. Often, Decision Trees models are constructed greedily in a top-down fashion via heuristic search criteria, such as Gini impurity or entropy. However, trees constructed in this manner are sensitive to minor fluctuations in training data and are prone to overfitting. In contrast, Bayesian approaches to tree construction formulate the selection process as a posterior inference problem; such approaches are more stable and provide greater theoretical guarantees. However, generating Bayesian Decision Trees usually requires sampling from complex, multimodal posterior distributions. Current Markov Chain Monte Carlo-based approaches for sampling Bayesian Decision Trees are prone to mode collapse and long mixing times, which makes them impractical. In this paper, we propose a new criterion for training Bayesian Decision Trees. Our criterion gives rise to BCART-PCFG, which can efficiently sample decision trees from a posterior distribution across trees given the data and find the maximum a posteriori (MAP) tree. Learning the posterior and training the sampler can be done in time that is polynomial in the dataset size. Once the posterior has been learned, trees can be sampled efficiently (linearly in the number of nodes). At the core of our method is a reduction of sampling the posterior to sampling a derivation from a probabilistic context-free grammar. We find that trees sampled via BCART-PCFG perform comparable to or better than greedily-constructed Decision Trees in classification accuracy on several datasets. Additionally, the trees sampled via BCART-PCFG are significantly smaller -- sometimes by as much as 20x.
翻译:决策树是当今最受欢迎的机器学习模型之一, 因为它们在框外的性能和可解释性。 通常, 决策树模型是通过超常搜索标准, 如 Gini 杂质或 entropy, 以自上而下的方式以自上而下的方式构建的。 然而, 以这种方式构建的树对培训数据的轻微波动十分敏感, 并且容易过度配置。 相反, 巴伊西亚树建设方法将选择过程描述成一个后继推论问题; 此类方法更稳定, 并提供更大的理论保障。 但是, 创建巴伊西亚决策树通常需要从复杂、 多式联运的海报分布中取样。 目前 Markov 链 Monte Car 用于取样 Bayesian 决定树的策略容易发生模式崩溃和长时间混杂。 然而, 我们的标定标准是 BCARRT- PC, 可以有效地从树的表面分布到树上的红外线( MAP) 树中采集最高级的 。 在可比较的取样方法中, 我们的骨质样本中, 将数据从一个直径流数据从一个直到一个开始, 。