Feature selection is used to eliminate redundant features and keep relevant features, it can enhance machine learning algorithm's performance and accelerate computing speed. In various methods, mutual information has attracted increasingly more attention as it's an effective criterion to measure variable correlation. However, current works mainly focus on maximizing the feature relevancy with class label and minimizing the feature redundancy within selected features, we reckon that pursuing feature redundancy minimization is reasonable but not necessary because part of so-called redundant features also carries some useful information to promote performance. In terms of mutual information calculation, it may distort the true relationship between two variables without proper neighborhood partition. Traditional methods usually split the continuous variables into several intervals even ignore such influence. We theoretically prove how variable fluctuation negatively influences mutual information calculation. To remove the referred obstacles, for feature selection method, we propose a full conditional mutual information maximization method (FCMIM) which only considers the feature relevancy in two aspects. For obtaining a better partition effect and eliminating the negative influence of attribute fluctuation, we put up an adaptive neighborhood partition algorithm (ANP) with the feedback of mutual information maximization algorithm, the backpropagation process helps search for a proper neighborhood partition parameter. We compare our method with several mutual information methods on 17 benchmark datasets. Results of FCMIM are better than other methods based on different classifiers. Results show that ANP indeed promotes nearly all the mutual information methods' performance.
翻译:选择功能用来消除冗余特性并保持相关特性,它可以提高机器学习算法的性能并加速计算速度。 在各种方法中,相互信息日益引起更多的注意,因为它是衡量可变关联的有效标准。然而,目前的工作主要侧重于尽可能扩大与类标签相关的特性,并尽量缩小选定特性的冗余。我们认为,追求特性冗余最小化是合理的,但并不必要,因为所谓的冗余特性的一部分也含有一些有用的信息,以促进性能。在相互信息计算方面,它可能扭曲两个变量之间的真实关系,而没有适当的邻里分区分隔。传统方法通常将连续变量分成几个间隔,甚至忽略这种影响。我们理论上证明变量波动如何对相互信息计算产生消极影响。为了消除所提到的障碍,对于特性选择方法而言,我们建议一个完全有条件的相互信息最大化方法(FCMIM),它只考虑特性在两个方面的相关性。为了获得更好的隔断效应和消除属性波动的负面影响,我们设置了适应性邻里分区分区分配算法(ANP),在相互信息最大化算法的反馈中,反向调整过程有助于搜索适当的区隔间隔断法。我们用不同的方法来比较了不同的数据。