Effective feature selection is essential for high-dimensional data analysis and machine learning. Unsupervised feature selection (UFS) aims to simultaneously cluster data and identify the most discriminative features. Most existing UFS methods linearly project features into a pseudo-label space for clustering, but they suffer from two critical limitations: (1) an oversimplified linear mapping that fails to capture complex feature relationships, and (2) an assumption of uniform cluster distributions, ignoring outliers prevalent in real-world data. To address these issues, we propose the Robust Autoencoder-based Unsupervised Feature Selection (RAEUFS) model, which leverages a deep autoencoder to learn nonlinear feature representations while inherently improving robustness to outliers. We further develop an efficient optimization algorithm for RAEUFS. Extensive experiments demonstrate that our method outperforms state-of-the-art UFS approaches in both clean and outlier-contaminated data settings.
翻译:高效的特征选择对于高维数据分析和机器学习至关重要。无监督特征选择(UFS)旨在同时对数据进行聚类并识别最具判别性的特征。现有的大多数UFS方法将特征线性投影到一个伪标签空间进行聚类,但它们存在两个关键局限:(1)过于简化的线性映射无法捕捉复杂的特征关系;(2)假设聚类分布均匀,忽略了现实数据中普遍存在的异常值。为解决这些问题,我们提出了基于鲁棒自编码器的无监督特征选择(RAEUFS)模型,该模型利用深度自编码器学习非线性特征表示,同时本质上提升了对异常值的鲁棒性。我们进一步为RAEUFS开发了一种高效的优化算法。大量实验表明,无论在清洁数据还是受异常值污染的数据场景下,我们的方法均优于当前最先进的UFS方法。