关于在建造双层决策树时选择地物的效用 (On the utility of feature selection in building two-tier decision trees)

Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementarity must be considered. Informally, if the features are weak predictors of the target variable separately and strong predictors when combined, then they are complementary. It is demonstrated in this paper that the synergistic effect of complementary features mutually amplifying each other in the construction of two-tier decision trees can be interfered with by another feature, resulting in a decrease in performance. It is demonstrated using cross-validation on both synthetic and real datasets, regression and classification, that removing or eliminating the interfering feature can improve performance by up to 24 times. It has also been discovered that the lesser the domain is learned, the greater the increase in performance. More formally, it is demonstrated that there is a statistically significant negative rank correlation between performance on the dataset prior to the elimination of the interfering feature and performance growth after the elimination of the interfering feature. It is concluded that this broadens the scope of feature selection methods for cases where data and computational resources are sufficient.

翻译：目前,在机器学习中经常使用特征选择,因为过度装配或计算资源有限,有可能造成性能退化。在特征选择过程中,选择了最相关和最少冗余的一组特征。近年来,很明显,除了相关性和冗余外,还必须考虑特征的互补性。非正式地,如果特征是目标变量的薄弱预测器,单独预测器和强力预测器的薄弱预测器,那么这些特征是相辅相成的。本文表明,在建造二层决策树的过程中,互补特征相互扩大的协同效应可能会受到另一个特征的干扰,从而导致性能下降。在合成和真实数据集、回归和分类上使用交叉校准,排除或消除干扰性特征可以提高性能达24倍。还发现,当了解的领域越少,性能越好。更正式地表明,在消除干扰性特征之前,在数据集的性能和性能增长之间存在显著的负级关系,从而导致性能下降。在消除干扰性能特征、回归和分类时,在消除干扰性能特性后,使用交叉性特征,从而得出了这种方法。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

53+阅读 · 2021年1月20日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日