控件 Burn: 非线性特性选择, 带有松散的树形集合 (ControlBurn: Nonlinear Feature Selection with Sparse Tree Ensembles)

ControlBurn is a Python package to construct feature-sparse tree ensembles that support nonlinear feature selection and interpretable machine learning. The algorithms in this package first build large tree ensembles that prioritize basis functions with few features and then select a feature-sparse subset of these basis functions using a weighted lasso optimization criterion. The package includes visualizations to analyze the features selected by the ensemble and their impact on predictions. Hence ControlBurn offers the accuracy and flexibility of tree-ensemble models and the interpretability of sparse generalized additive models. ControlBurn is scalable and flexible: for example, it can use warm-start continuation to compute the regularization path (prediction error for any number of selected features) for a dataset with tens of thousands of samples and hundreds of features in seconds. For larger datasets, the runtime scales linearly in the number of samples and features (up to a log factor), and the package support acceleration using sketching. Moreover, the ControlBurn framework accommodates feature costs, feature groupings, and $\ell_0$-based regularizers. The package is user-friendly and open-source: its documentation and source code appear on https://pypi.org/project/ControlBurn/ and https://github.com/udellgroup/controlburn/.

翻译：控制 Burn 是用于构建非线性特征选择和可解释的机器学习的地貌分析树集合的 Python 软件包, 支持非线性特征选择和可解释的机器学习。这个软件包的算法首先构建大树集合, 以少数特性为基准功能的优先排序, 然后使用加权的 lasso 优化标准选择这些基础功能的地貌分析子子子子集。该软件包包括用于分析由组合所选特征及其对预测的影响的可视化功能。因此, 控制 Burn 提供了树类模型的准确性和灵活性以及稀有的通用添加型模型的可解释性。控制 Burn 框架可以可缩放和灵活: 例如, 它可以使用温暖的启动性继续来为包含数以万个样本和数以百秒计特征的数据集拼写正规化路径( 任何选定特性的错误) 。对于更大的数据集来说, 运行时间尺度是样本和特性数的线性尺度( 到一个逻辑系数), 包支持使用素描图的加速。此外, 框架框架可以容纳成本成本, 组合, 和 $ell_0__burg/ brentrent / brudepril 。。。。。。。。。和和 $_ 。

相关内容

特征选择

关注 5935

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日