Random Forests (RF) is a popular machine learning method for classification and regression problems. It involves a bagging application to decision tree models. One of the primary advantages of the Random Forests model is the reduction in the variance of the forecast. In large scale applications of the model with millions of data points and hundreds of features, the size of the fitted objects can get very large and reach the limits on the available space in production setups, depending on the number and depth of the trees. This could be especially challenging when trained models need to be downloaded on-demand to small devices with limited memory. There is a need to approximate the trained RF models to significantly reduce the model size without losing too much of prediction accuracy. In this project we study methods that approximate each fitted tree in the Random Forests model using the multinomial allocation of the data points to the leafs. Specifically, we begin by studying whether fitting a multinomial logistic regression (and subsequently, a generalized additive model (GAM) extension) to the output of each tree helps reduce the size while preserving the prediction quality.
翻译:随机森林(RF)是一种流行的机器学习方法,用于分类和回归问题。它涉及对决策树模型的包装应用。随机森林模型的主要优点之一是缩小预测值的差异。在模型的大规模应用中,有数百万个数据点和数百个特征,装配对象的大小可以很大,并达到生产装置现有空间的极限,视树木的数量和深度而定。如果需要将经过训练的模型下载到记忆有限的小型设备上,这尤其具有挑战性。需要接近经过训练的RF模型,以大幅度降低模型大小,同时又不过分丧失预测准确性。在这个项目中,我们研究使用将数据点多位分配到叶子上的随机森林模型中的每棵树的近似方法。具体地说,我们首先研究是否将多位物流回归(以及随后的通用添加模型(GAM)扩展)与每棵树的输出相匹配,有助于在保持预测质量的同时减少其大小。