分类数据群集的地物选择复杂度 (Parameterized Complexity of Feature Selection for Categorical Data Clustering)

We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $\ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $\Sigma$), we want to select $m-\ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($\ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|\Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|\Sigma|)\cdot m^{g(k,|\Sigma|)}\cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|\Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.

翻译：我们开发了新的算法方法, 在绝对数据组中为特性选择提供可辨别的保证。虽然特性选择是减少多元性的最常见方法之一, 但大多数已知特性选择方法都是超自然的。我们研究以下数学模型。我们假设输入数据中有一些无意( 或不可取) 特性不必要地增加了组合成本。因此, 我们想要从数据中选择一个原始特性的子数, 从而在选定的特性上有一个小成本组合。更确切地说, 对于给定的整数 $( 无关特性的数目) 和 $k$( 组的数目), 预算 $B$( 已知的特性) 和一套美元绝对数据点( 以美元表示其元素属于一定值的一组值 $( gmam) 。我们想要从这些特性上选择一个最优的 $( ) 美元分组的成本不会超过 $( 美元) 。类组的成本是以美元美元( 美元- 美元- 美元- c) 数的精度值的精度计算, 我们的精度框架的精度的精度的精度的精度的精度的精度的精度的精度。我们的精度的精度和基的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

45+阅读 · 2020年12月18日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日