We develop a sparse optimization problem for the determination of the total set of features that discriminate two or more classes. This is a sparse implementation of the centroid-encoder for nonlinear data reduction and visualization called Sparse Centroid-Encoder (SCE). We also provide a feature selection framework that first ranks each feature by its occurrence, and the optimal number of features is chosen using a validation set. The algorithm is applied to a wide variety of data sets including, single-cell biological data, high dimensional infectious disease data, hyperspectral data, image data, and speech data. We compared our method to various state-of-the-art feature selection techniques, including two neural network-based models (DFS, and LassoNet), Sparse SVM, and Random Forest. We empirically showed that SCE features produced better classification accuracy on the unseen test data, often with fewer features.
翻译:我们开发了一个稀有的优化问题,用于确定歧视两个或两个以上类别的全部特征组。这是一个用于非线性数据减少和可视化的非线性数据减少和可视化(Sparse Centrid-Encoder (SCE))的机器人编码器实施稀少的问题。我们还提供了一个特征选择框架,每个特征先按其发生情况排位,而最佳特征数则使用一个验证集来选择。算法适用于各种各样的数据集,包括单细胞生物数据、高维度传染病数据、超光谱数据、图像数据和语音数据。我们将我们的方法与各种最新特征选择技术进行了比较,包括两个基于神经网络的模型(DFS和LassoNet)、Sprass SVM 和随机森林。我们从经验上表明, Srassy SVM 和Rand Forest Form ) 的特征对看不见的测试数据进行了更好的分类精度,其特性往往较少。