Feature selection in clustering is a hard task which involves simultaneously the discovery of relevant clusters as well as relevant variables with respect to these clusters. While feature selection algorithms are often model-based through optimised model selection or strong assumptions on $p(\pmb{x})$, we introduce a discriminative clustering model trying to maximise a geometry-aware generalisation of the mutual information called GEMINI with a simple $\ell_1$ penalty: the Sparse GEMINI. This algorithm avoids the burden of combinatorial feature subset exploration and is easily scalable to high-dimensional data and large amounts of samples while only designing a clustering model $p_\theta(y|\pmb{x})$. We demonstrate the performances of Sparse GEMINI on synthetic datasets as well as large-scale datasets. Our results show that Sparse GEMINI is a competitive algorithm and has the ability to select relevant subsets of variables with respect to the clustering without using relevance criteria or prior hypotheses.
翻译:集群中的特性选择是一项艰巨的任务,需要同时发现相关组群和与这些组群相关的变量。 特征选择算法通常通过优化模型选择或$p( pmb{x})$$( p\ pmb{x}) 的强烈假设以模型为基础,以模型为基础,但我们引入了一种歧视性组群模型,试图对被称为GEMINI的相互信息进行最大程度的几何认知的概括化,使用简单的$/ ell_ 1美元罚款: Sparse GEMINI。 这个算法避免了组合特性子群的探索负担,并且很容易对高维数据和大量样本进行缩放,而只是设计了一个组合模型 $p ⁇ theta(y ⁇ pmb{x})$。 我们展示了Sprass GEMINI 在合成数据集和大型数据集上的性能。 我们的结果表明, Sprass GEMINI是一个竞争性算法,有能力选择与组合有关的变量的子组群集,而不使用相关标准或先前的假设。