与相似查询的模糊集合 (Fuzzy Clustering with Similarity Queries)

The fuzzy or soft $k$-means objective is a popular generalization of the well-known $k$-means problem, extending the clustering capability of the $k$-means to datasets that are uncertain, vague, and otherwise hard to cluster. In this paper, we propose a semi-supervised active clustering framework, where the learner is allowed to interact with an oracle (domain expert), asking for the similarity between a certain set of chosen items. We study the query and computational complexities of clustering in this framework. We prove that having a few of such similarity queries enables one to get a polynomial-time approximation algorithm to an otherwise conjecturally NP-hard problem. In particular, we provide probabilistic algorithms for fuzzy clustering in this setting that asks $O(\mathsf{poly}(k)\log n)$ similarity queries and run with polynomial-time-complexity, where $n$ is the number of items. The fuzzy $k$-means objective is nonconvex, with $k$-means as a special case, and is equivalent to some other generic nonconvex problem such as non-negative matrix factorization. The ubiquitous Lloyd-type algorithms (or, expectation-maximization algorithm) can get stuck at a local minima. Our results show that by making few similarity queries, the problem becomes easier to solve. Finally, we test our algorithms over real-world datasets, showing their effectiveness in real-world applications.

翻译：模糊或软 $k美元代表单位的目标就是对众所周知的 $k美元代表单位问题进行流行化的概括化分析,将 $k美元代表单位的组合能力扩大到不确定、模糊或难以分组的数据集。尤其在本文件中,我们提出一个半监督的活跃组合框架, 允许学习者与一个神器( Domain 专家) 互动, 询问某组选定的项目之间的相似性。我们研究在这个框架中集成的查询和计算复杂性。我们证明, 有一些类似的查询, 使得人们能够将 $k美元代表单位的集成时间接近率算法更方便地将多到一个本盘式NP- hard的问题。特别是, 我们为这个设置的模糊组合提供了概率算法, 要求学习者可以与一个神器( mostmaths fall{poly} (k) (k)\log n) 相似的查询, 并运行一个多时段时间- 复杂性, 美元是项目的数量。我们的模糊 $k$- 表示某些特殊货币化目标是非Conexx 的测试结果, 以等式的方式显示一个非等式。