Modern datasets often contain large subsets of correlated features and nuisance features, which are not or loosely related to the main underlying structures of the data. Nuisance features can be identified using the Laplacian score criterion, which evaluates the importance of a given feature via its consistency with the Graph Laplacians' leading eigenvectors. We demonstrate that in the presence of large numbers of nuisance features, the Laplacian must be computed on the subset of selected features rather than on the complete feature set. To do this, we propose a fully differentiable approach for unsupervised feature selection, utilizing the Laplacian score criterion to avoid the selection of nuisance features. We employ an autoencoder architecture to cope with correlated features, trained to reconstruct the data from the subset of selected features. Building on the recently proposed concrete layer that allows controlling for the number of selected features via architectural design, simplifying the optimization process. Experimenting on several real-world datasets, we demonstrate that our proposed approach outperforms similar approaches designed to avoid only correlated or nuisance features, but not both. Several state-of-the-art clustering results are reported.
翻译:现代数据集通常包含大量相关特征和骚扰特征的子集,这些特征与数据的主要基本结构没有关系,或与数据的主要基础结构没有松散关系。使用拉普拉西亚分分标准可以辨别出扰动特征,该标准通过与拉普拉西亚图中主要的源代体的一致性来评估某一特征的重要性。我们证明,在存在大量扰动特征的情况下,拉普拉西亚必须按选定特征的子集而不是完整特征集来计算。为了做到这一点,我们建议采用完全不同的方法来选择不受监督的特征,利用拉普拉西亚分分标准来避免选择扰动特征。我们使用自动编码结构来应对相关特征,经过培训从选定特征组群中重建数据。我们利用最近提出的能够通过建筑设计来控制选定特征数量的混凝土,简化了优化进程。在几个真实世界数据集上进行实验,我们证明我们拟议的方法与为避免相关或扰动特征而设计的类似方法不同,但是没有同时报告结果。