Feature selection (FS) is an important research topic in machine learning. Usually, FS is modelled as a+ bi-objective optimization problem whose objectives are: 1) classification accuracy; 2) number of features. One of the main issues in real-world applications is missing data. Databases with missing data are likely to be unreliable. Thus, FS performed on a data set missing some data is also unreliable. In order to directly control this issue plaguing the field, we propose in this study a novel modelling of FS: we include reliability as the third objective of the problem. In order to address the modified problem, we propose the application of the non-dominated sorting genetic algorithm-III (NSGA-III). We selected six incomplete data sets from the University of California Irvine (UCI) machine learning repository. We used the mean imputation method to deal with the missing data. In the experiments, k-nearest neighbors (K-NN) is used as the classifier to evaluate the feature subsets. Experimental results show that the proposed three-objective model coupled with NSGA-III efficiently addresses the FS problem for the six data sets included in this study.
翻译:功能选择( FS) 是机器学习的一个重要研究课题。 通常, FS 是一个+ 双目标优化问题, 其目标为:(1) 分类准确性;(2) 特征数目。 真实世界应用中的主要问题之一是缺少数据。 缺少数据的数据库可能不可靠。 因此, 在缺少某些数据的数据集中进行的 FS 也是不可靠的。 为了直接控制这一问题, 我们在本研究中建议对 FS 进行新的模型分析: 我们把可靠性作为问题的第三个目标。 为了解决修改的问题, 我们建议应用非主的基因算法III(NSGA- III) 。 我们从加利福尼亚大学Irvine(UCI)机器学习库中选择了六个不完整的数据集。 我们用平均的浸渍方法处理缺失的数据。 在实验中, K- 最近邻( K- NNN) 被用作分类员来评估地段。 实验结果表明, 与 NGA- III 一起提出的三个目标模型与 NGA- III 有效解决了本研究中包含的六个数据集中的FS 。