Nowadays, feature selection is frequently used in machine learning when there is a risk of performance degradation due to overfitting or when computational resources are limited. During the feature selection process, the subset of features that are most relevant and least redundant is chosen. In recent years, it has become clear that, in addition to relevance and redundancy, features' complementarity must be considered. Informally, if the features are weak predictors of the target variable separately and strong predictors when combined, then they are complementary. It is demonstrated in this paper that the synergistic effect of complementary features mutually amplifying each other in the construction of two-tier decision trees can be interfered with by another feature, resulting in a decrease in performance. It is demonstrated using cross-validation on both synthetic and real datasets, regression and classification, that removing or eliminating the interfering feature can improve performance by up to 24 times. It has also been discovered that the lesser the domain is learned, the greater the increase in performance. More formally, it is demonstrated that there is a statistically significant negative rank correlation between performance on the dataset prior to the elimination of the interfering feature and performance growth after the elimination of the interfering feature. It is concluded that this broadens the scope of feature selection methods for cases where data and computational resources are sufficient.
翻译:目前,在机器学习中经常使用特征选择,因为过度装配或计算资源有限,有可能造成性能退化。在特征选择过程中,选择了最相关和最少冗余的一组特征。近年来,很明显,除了相关性和冗余外,还必须考虑特征的互补性。非正式地,如果特征是目标变量的薄弱预测器,单独预测器和强力预测器的薄弱预测器,那么这些特征是相辅相成的。本文表明,在建造二层决策树的过程中,互补特征相互扩大的协同效应可能会受到另一个特征的干扰,从而导致性能下降。在合成和真实数据集、回归和分类上使用交叉校准,排除或消除干扰性特征可以提高性能达24倍。还发现,当了解的领域越少,性能越好。更正式地表明,在消除干扰性特征之前,在数据集的性能和性能增长之间存在显著的负级关系,从而导致性能下降。在消除干扰性能特征、回归和分类时,在消除干扰性能特性后,使用交叉性特征,从而得出了这种方法。