We develop new algorithmic methods with provable guarantees for feature selection in regard to categorical data clustering. While feature selection is one of the most common approaches to reduce dimensionality in practice, most of the known feature selection methods are heuristics. We study the following mathematical model. We assume that there are some inadvertent (or undesirable) features of the input data that unnecessarily increase the cost of clustering. Consequently, we want to select a subset of the original features from the data such that there is a small-cost clustering on the selected features. More precisely, for given integers $\ell$ (the number of irrelevant features) and $k$ (the number of clusters), budget $B$, and a set of $n$ categorical data points (represented by $m$-dimensional vectors whose elements belong to a finite set of values $\Sigma$), we want to select $m-\ell$ relevant features such that the cost of any optimal $k$-clustering on these features does not exceed $B$. Here the cost of a cluster is the sum of Hamming distances ($\ell_0$-distances) between the selected features of the elements of the cluster and its center. The clustering cost is the total sum of the costs of the clusters. We use the framework of parameterized complexity to identify how the complexity of the problem depends on parameters $k$, $B$, and $|\Sigma|$. Our main result is an algorithm that solves the Feature Selection problem in time $f(k,B,|\Sigma|)\cdot m^{g(k,|\Sigma|)}\cdot n^2$ for some functions $f$ and $g$. In other words, the problem is fixed-parameter tractable parameterized by $B$ when $|\Sigma|$ and $k$ are constants. Our algorithm is based on a solution to a more general problem, Constrained Clustering with Outliers. We also complement our algorithmic findings with complexity lower bounds.
翻译:我们开发了新的算法方法, 在绝对数据组中为特性选择提供可辨别的保证。 虽然特性选择是减少多元性的最常见方法之一, 但大多数已知特性选择方法都是超自然的。 我们研究以下数学模型。 我们假设输入数据中有一些无意( 或不可取) 特性不必要地增加了组合成本。 因此, 我们想要从数据中选择一个原始特性的子数, 从而在选定的特性上有一个小成本组合。 更确切地说, 对于给定的整数 $( 无关特性的数目) 和 $k$( 组的数目), 预算 $B$( 已知的特性) 和一套美元绝对数据点( 以美元表示其元素属于一定值的一组值 $( gmam) 。 我们想要从这些特性上选择一个最优的 $( ) 美元分组的成本不会超过 $( 美元) 。 类组的成本是 以 美元 美元( 美元- 美元- 美元- c) 数 的精度值 的精度计算, 我们的精度框架的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 。 我们的精度的精度 和基的精度的精度 的精度的精度的精度 的精度 的精度的精度 的精度的精度 的精度 的精度 。