We study the classical metric $k$-median clustering problem over a set of input rankings (i.e., permutations), which has myriad applications, from social-choice theory to web search and databases. A folklore algorithm provides a $2$-approximate solution in polynomial time for all $k=O(1)$, and works irrespective of the underlying distance measure, so long it is a metric; however, going below the $2$-factor is a notorious challenge. We consider the Ulam distance, a variant of the well-known edit-distance metric, where strings are restricted to be permutations. For this metric, Chakraborty, Das, and Krauthgamer [SODA, 2021] provided a $(2-\delta)$-approximation algorithm for $k=1$, where $\delta\approx 2^{-40}$. Our primary contribution is a new algorithmic framework for clustering a set of permutations. Our first result is a $1.999$-approximation algorithm for the metric $k$-median problem under the Ulam metric, that runs in time $(k \log (nd))^{O(k)}n d^3$ for an input consisting of $n$ permutations over $[d]$. In fact, our framework is powerful enough to extend this result to the streaming model (where the $n$ input permutations arrive one by one) using only polylogarithmic (in $n$) space. Additionally, we show that similar results can be obtained even in the presence of outliers, which is presumably a more difficult problem.
翻译:我们研究经典的美元标准中位群集问题,研究的是从社会选择理论到网络搜索和数据库等多种应用的一组投入排名(即变换),这有多种应用。民俗算法在多式时间为所有美元=O(1)美元提供$2美元近似的解决办法,不管基本的距离计量值如何,只要它是一个计量值;然而,如果低于2美元因素,则是一个臭名昭著的挑战。我们认为Ulam距离是众所周知的编辑-距离衡量标准的一种变异,其字符是限制变换的。对于这个衡量标准,Chakraborty、Das和Krauthgamer[SODODO,20211] 提供了1美元(2\delta)美元近似的解决办法,其值为1美元,而$dta\appropprox 2 ⁇ -40}。我们的主要贡献是一套新的算法框架,用来组合一组变现。我们的第一个结果是1999美元(我们获得的)加码算法算法,它用于每公吨的美元模型值美元,在Olam框架下显示一个类似时间,它的结果是每美元。在Olexalexalexxlexxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx